The Flow Manifesto


New: See a simpler introduction to the thinking behind the Flow Programming Language on the following page: Solving the Multicore Dilemma


"Finding: There is no known alternative for sustaining growth in computing performance; however, no compelling programming paradigms for general parallel systems have yet emerged."
--The Future of Computing Performance: Game Over or Next Level?
Fuller & Millett (Eds.);  Committee on Sustaining Growth in Computing Performance,
National Research Council, The National Academies Press, 2010, p.81


INTRODUCTION

With the advent of the multicore era, the continuation of Moore's Law will depend upon efficient usage of multiple cores. However, most multiprocessing frameworks incur so much additional design and implementation overhead to shoehorn a programmer's design into the framework that they are impractical to apply for all but core inner processing loops. More importantly though, most human brains are inherently incapable of designing safe multithreaded code using many of the current multithreading tools. Therefore, without improving the state of the art, not only are we entering an era of slowed growth in computing performance, but we are potentially entering an era of significantly buggier software.

Flow is a new programming paradigm, as well as a design for a reference programming language that implements this paradigm, that aims to solve the multicore dilemma while providing certain strong and specific guarantees about program correctness and runtime safety.

Flow provides implicit parallelism by syntactically enforcing every program to be a lattice or partial ordering, such that the program itself is the data dependency graph of the program, making it trivial to find parallelizations.  Flow also draws from category theory to find optimal and provably correct reordering and grouping of computations to maximize parallel speedup and minimize interprocess communication (increasing the parallelizable portion of the program and therefore the effective speedup; see Amdahl's Law).

Style note: Consider including the keyword "flowlang" on each page about the Flow Programming Language to make it easier for search engines to find content about Flow.

Goals of Flow
  1. Solve the multicore dilemma: Provide ubiquitous implicit parallelization with close to zero programmer effort
  2. Prevent programmers from shooting themselves in the foot by making it physically impossible to introduce race conditions or deadlocks into a typical parallel program.
  3. Further optimize implicit parallelization to achieve maximum speedup by data reordering and grouping through inference on function properties (associativity / commutativity etc.).  This allows for powerful partitioning of the code and the creation of an optimal threading plan.
  4. "Write once, parallelize anywhere" -- With very little modification, the same Flow code should be able to target hadoop / Google's MapReduce, the JVM using Java Threads, C using pthreads, MPI or a GPU using CUDA.  By understanding the big-Oh complexity of computations in Flow programs, The compiler will produce code that optimally partitions the workload in systems with nonuniform computing speeds and/or communication costs, e.g. CPU/GPU combinations.
  5. Precise automatic memory allocation and deallocation -- there is no need to explicitly malloc/free memory, and there is also no garbage collector (or long unpredictable pauses due to garbage collection, which is one of the reasons Google won't use Java for many critical parts of their infrastructure).  Memory is allocated precisely when it is needed and it is free precisely when it is no longer needed, by looking at incoming and outgoing edges of each node in the program lattice respectively.
Flow is not just a single language, but a compiler platform that many language frontends can target and that will be include backends for most parallel computing platforms. Ideally flow will eventually hook into LLVM, so that with some limitations a variety of standard programming languages could target Flow and thereby make use of its parallelization functionality.

Flow will include a small reference language with very simple and extensible syntax running on top of the generic Flow compiler platform.  The reference language will focus on the use case of data analysis pipelines, and as such will provide powerful built-in support for string and data manipulation with collections.

You will notice there is very little discussion of actual syntax below -- this document almost exclusively covers semantics and philosophy.  A wide range of different programming language syntaxes could be mapped to these semantics.

Ultimately you will be able to use Flow without understanding any of the theory behind it, and it should "just work".  However the theory behind Flow is critical to how it works, so it is presented below.

BACKGROUND


The Multicore Dilemma

To continue to extend Moore's Law we have to make full use of the multiple cores available in modern CPUs, but software is not keeping up with hardware.  Implicit parallelization is becoming urgent because humans are notoriously bad at writing multithreaded code, and because it's a heck of a lot of (what should be unnecessary) work.  However it is only possible to correctly implicitly parallelize a program when the compiler has exact knowledge of the data dependencies of the program.


The ability to read the current value of a variable: "considered harmful"

The goto statement (and the flowchart, by direct analogy) has been considered harmful since the early days of computer programming languages, and they have been superceded more recently by structured programming.  However while eliminating goto dramatically helped software safety in the single-core era, it is not enough for the multi-core era. The ability to read the current value of a variable should also be considered harmful, because it makes it impossible to completely and correctly determine the data dependency graph of a program at compiletime, which makes optimal implicit parallelization with safety and ordering guarantees impossible in the general case.   The notion of "the current value of a variable" is to the data dependency graph of a parallel program as goto was to the control flow of a serial program.  Most of the serious programmer errors that don't simply involve incorrect logic are due to incorrectly enforcing or misinterpreting data dependency constraints.

The harmfulness of the concept of current value can be justified somewhat esoterically by reference to Einstein's principle of the relativity of simultaneity: if there is greater than zero distance between independent observers, communication between them takes greater than zero time, and therefore there is no such thing as an absolute clock or an absolute concept of "now" that is valid in all reference frames.  Why should we expect to be able to get the value of a variable "now" in parallel programming without unexpected results?

An algorithm can always be diagrammed as a partial ordering (directed acyclic graph / DAG) of the data dependencies between operands and the values of operators, however that dependency graph is not immediately accessible through static analysis of source code written in most programming languages -- you don't know what the actual data dependencies are between specific values until you read those values at runtime with imperative languages.  Implicit parallelization attempts to recover the data dependency DAG from a serial program, effectively trying to undo the topological sort ordering that renders the true data dependency DAG into the original serialized program.  This is a very hard problem.

One of the most basic tenets of any programming language is that you should be able to read the value of a variable.  One of the most basic tenets of an imperative programming language is that you should be able to read the current value of a variable, with the understanding that that value can change over time, hence the term 'variable'.  However the ability to read the current value of a variable (as contrasted with being able to read a constant's final immutable value or a variable's value at a specific specified timestamp, i.e. after a specific other timestamped computation completes) is precisely what prevents compilers of imperative languages from determining the exact data dependency graph of a program at compiletime: the actual dependencies between data can't in general be known until the variables are actually read at runtime. Notice that if you were to draw up a data dependency graph, the dependencies would be between values and not between variables, because values are computed from other values (which happen to be read in from the current value of variables in imperative languages).

If the concept of reading a variable's current value is allowed in a language, then the topology of the partial ordering of data dependencies is brittle: when a variable changes value, at that specific point in time you can think of the node in the data dependency graph splitting into two nodes, "old value" and "new value", and from that point on, anything that refers to that variable can only point to the new value, while already-computed dependents of the variable "still point to" the old value in the data dependency graph, since they already read from the variable before it changed.  This can lead to unpredictable results at runtime, since in referring to a variable, the programmer might mean the old value or might mean the new value, depending on exact timing, giving rise to race conditions and unexpected side effects.  Also, the fact that data dependencies can change at any time means that it is not possible to know precisely when memory needs to be freed and and when it is possible to drop references to allocated memory. This means memory allocations have to be held onto until the programmer explicitly releases them or a garbage collector can't find any references to them.

The two alternatives to "current value" are (1) using constants instead, i.e. forcing "variables" to have single, final/immutable values, and (2) requiring references to variables to state the specific timestamp for which they want to read from the variable, because this represents a determinate, specific value (and corresponds to a specific node in the data dependency DAG). Both these alternatives to standard variable dereference semantics allow the data dependency graph to be known precisely at compiletime.


THE FUNDAMENTAL PRINCIPLE OF FLOW:
The program is the data dependency graph

The Fundamental Principle of Flow is that for a programming language to provide optimal implicit parallelization, its syntax should be restricted such that the program's source code directly and statically yields its data dependency graph, meaning that the compiler can actually verify whether or not there are any statically-ambiguous or cyclic data dependency paths in a program, and refuse to compile any programs that do have them. This semantic restriction allows the compiler to make direct inferences about parallelizability and memory allocation lifecycle.

Many design aspects immediately cascade from this Fundamental Principle:
  • Every Flow program is a directed acyclic graph or partial ordering. (This is equivalent to stating, as above, that the program's source code directly and statically yields its data dependency graph, since any other data dependency graph must be produced by an imperative program whose structure is not a DAG, because it refers to the current values of variables.)
  • By definition, nodes in a program's DAG represent the result of applying a function to produce a specific value or collection of values (as contrasted with variables), and that value or collection of values must be immutable: a value cannot be read until it is initialized and is immutable once it is initialized. Restated, every value should have either exactly one writer and zero readers, or zero writers and one or more readers. This eliminates the harmful notion of "the current value of a variable" as discussed above.  There is more discussion below about how this principle of immutability is made less restrictive by using timestamps and scatter operations, allowing for "safe imperative programming".
  • The order of computation of a program is the partial ordering of the program's statements.  When there is no defined ordering relationship between two nodes (no path through the DAG from one node to the other), they can be computed in parallel.
  • The order of memory allocation and deallocation in the program is also directly induced by the data dependency graph of the program: the compiler knows exactly when memory needs to be allocated and when it can be freed.  Memory allocation is precise, implicit (there is no need for manual "malloc" or "free" calls) and there is no need for a garbage collector. (See notes on why garbage collection is bad below.)
  • Deadlocks are not possible because by definition Directed Acyclic Graphs do not have cycles.  
  • It is not possible to cause race conditions or access uninitialized memory because values are always initialized when they are read and they are immutable thereafter.
The partial ordering of a program's statements can be pruned to remove unused values (by finding the transitive closure), producing a lattice that represents the complete set of computations required from program start to program termination -- computations that produce unused values can automatically be pruned, and/or the user can be warned about unused code.

The resulting lattice can then be seen as a streaming pipeline that data is pushed through as it is transformed from input into output.  The same pipeline can be applied to each element of a (possibly infinite) list, blocking on list element availability, allowing for processing datasets as well as event streams.  This model of parallel computation is ideally suited for scaling from CPUs to GPUs all the way up to MapReduce clusters without changes to the source code.  In fact, Flow programs can be thought of as a large MapReduce pipeline, and this is true in a very deep sense (see "Category Theory Applied to Lattices" below).

This is related to earlier paradigms such as "Data Flow Programming", however with Flow the single semantic restriction imposed on the language (the inability to read the current value of a variable) is as minimalistic and axiomatic as possible (while being both necessary and sufficient to provide the above guarantees), and restricts the syntax of a language as little as possible, so languages based on Flow should appear much more "normal" than Data Flow Programming languages, which typically look like circuit diagrams, and are hard to visually parse.  A lookalike version of many modern programming languages could be produced that replaces some unsafe constructs with safe versions, and is thereby able to target the Flow platform with minimum programmer retraining.

Aside: Why Garbage Collection is bad

The fact that Flow eliminates the need for garbage collection and also eliminates the need for manual memory management (malloc/free or new/delete) is actually a very big deal. Just as programmers are extremely bad at writing multithreaded code and having it be maximally efficient and bug-free, so too are programmers bad at managing their own memory -- but garbage collection is only a halfbaked improvement over manual memory management, and it can turn into a crutch that programmers simply rely upon rather than understanding the nuances involved, and this can lead to serious memory leak issues.

Garbage collection was a helpful small evolutionary step in programming language design. However we need to move on to something better -- and the only truly better solution is precise implicit memory allocation and deallocation, as provided by Flow.
  • Garbage collection does not solve memory leak issues. as any Java programmer knows, it is trivially easy to hold a reference to a large object that is no longer needed for an arbitrarily long period of time, but the JVM has no way of predicting if any given referenced object might conceivably be used again at some point in the future.
  • In particular, if you allocate a large object in a certain scope, and then you're done using it, in Java and in a lot of other garbage-collected languages, the reference to the object is still in the local stack frame, so until that stack frame exits the garbage collector will not collect the value. This is particularly bad in data pipeline programming, where a lot of data transformations are typically implemented in the same scope, and memory hangs around a lot longer than is needed.
  • In fact if you are dealing with large data objects in Java, you have to manually set all references to them to null when you are finished with them if the references aren't immediately going to go out of scope. This may as well be manual memory management, it's basically the same as an explicit free/delete. However as with manually freeing a pointer in C, where the pointer stays in scope, in Java, a reference you have set to null also stays in scope, and if you later go back and modify your code and accidentally add some code below where you set the variable to null that makes use of the variable again, you'll get a NullPointerException. With Flow, right after the last thing that could ever use an object uses it, it is automatically freed.
  • Garbage collection also does not solve the need for manual memory allocation, which creates lots of boilerplate code -- e.g. when you create an ArrayList<ArrayList<String>> of specified dimensions, you need to iterate through and allocate the outer and each of the inner ArrayList objects yourself. Just adding Garbage collection to a language does not obviate the need for manual memory management.
  • GC latency can be large and unpredictable -- on the order of tens of seconds per GC on a machine with tens of gigabytes of memory. Google refuses to use Java for mission-critical core infrastructure for this reason.
  • GC generally also locks up all threads in the VM while one thread runs the garbage collector, which is becoming a bigger and bigger problem on today's multicore machines. Multithread-capable garbage collectors in the JVM are still viewed as experimental and are not on by default.

Curiosities of lattice-based programming

When a programming language describes programs as pipelines with lattice topologies, some interesting properties emerge:
  1. You can write program statements in any order -- they will be executed in the order of their data dependencies. In other words, you no longer need to perform your own mental topological sort of your program. We used to have to manually topologically sort the order of function declarations in our programs, because programming languages didn't allow forward declarations: we still have to sort function declarations in C, or at least use forward declarations of function types, but we don't have to topologically sort the order of methods in Java. Java is much easier to work with than C in this respect. Why should we have to topologically sort the statements within a method too?  Programming has become a process of not only figuring out what to compute in what order, but what order to list the statements in the program.  Shouldn't the compiler decide on the actual order of execution (while respecting data dependencies) to maximize program throughput?
  2. There is no stack! Very few programming models don't have a stack (a couple of notable exceptions are Prolog and finite state machines). The stack is replaced in Flow by conditional hierarchical expansion of sub-lattices, which is computationally equivalent to a stack (and may be implemented using a stack by the compiler backend).
  3. It should also be possible to build graph visualizers for both the program lattice and the data that is flowing through it.  In fact it may even make more sense to program graphically rather than in textfile form, because the ordering of program lines in a text file are a somewhat arbitrary and artificial serialization of the true graph structure of the program.  An editor could offer the programmer a dual view of the program, graphical and topologically sorted.
  4. You can write out the value of any node or all nodes in the lattice to a file as you run the program (a process known as "checkpointing").  This will allow for freezing and restarting the computation at any point, as well as fine-grained debugging of a potentially huge data pipeline.  Also by recording not just the range values produced by each morphism but the corresponding domain values, it should be possible to offer a form of time-rewinding with the debugger: the ability to step backwards through time once a breakpoint is hit, not just forwards.

    Making Flow Turing-Complete: The Structured Program Theorem

    Computation over a fixed-topology program lattice terminates in finite time assuming the operation at each node of the lattice takes finite time.  In fact the big-Oh complexity of the execution time of a fixed program lattice is quite easy to determine directly given certain input datasizes (a fact that will be used extensively in Flow for optimization purposes, described below).

    The fact that a fixed-topology program lattice terminates in finite time means that a programming language that only allows fixed lattice-based programs is not Turing-complete, because termination is decidable.  Therefore more elements must be added to the programming language to create a Turing-complete architecture.

    The Structured Program Theorem states that every computable function can be implemented by combining subprograms in only three ways:
    1. Executing one subprogram, and then another subprogram (sequence)
    2. Executing one of two subprograms according to the value of a boolean variable (selection)
    3. Executing a subprogram until a boolean variable is true (repetition)
    Condition 1 is a static condition, it can be satisfied by providing ordering features in the source code of the computational system in question.  Sequencing of computation is directly provided by the dependency nature of the lattice in Flow (you can chain together lattices into taller lattices such that one lattice is executed after another).

    Condition 2 is interesting: by changing the direction of the arrows in a data dependency diagram, lattices can be seen as either "push"/imperative models of computation (or as pipelines employing speculative execution, where all code paths are strictly evaluated or executed, and values only depended upon by unused branches of if-statements are discarded after they have been computed), or "pull"/functional models of computation (purely functional programs employing lazy evaluation, where values are not computed until they are needed).  Either way though, Flow supports the second condition.

    Condition 3 is the only condition that is not supported by the description of Flow so far as a fixed-topology lattice.  The way Flow implements the third condition is to expand sub-lattices within the larger program lattice until a desired condition is met (this is effectively a recursive version of the iterative loop logic described in Condition 3). Alternatively, in an iterative interpretation, this can be seen as a fixed-topology lattice being applied to each element of a (possibly infinite) list, with the value of the (n+1)th element in the list determined as the output of the lattice that depends on the (n)th element. Computation on the (n+1)th element blocks until the result has been computed from the (n)th element, so this enforces a strict ordering between of the two lattices.  The program terminates automatically when there are no more elements in the list, i.e. when the last iteration does not write an (n+1)th element to the end of the list.

    Generalization of the Structured Program Theorem to Computational Lattices

    Since fundamentally, all three steps in the Structured Program Theorem deal fundamentally with single-threaded, deterministic computation, it is worth thinking about how the Structured Program Theorem could be extended to arbitrary parallel programming, and in particular lattice-based programming. We define the "Structured Programming Theorem for Computational Lattices" as follows: every deterministic but implicitly-parallelizable computable function can be implemented by combining subprograms in only three ways:

    1. Executing a series of subprograms in series or in parallel, in an order constrained by a partial ordering of the subprograms (partial ordering)
    2. Executing one of two subprograms according to the value of a boolean variable (selection)
    3. Conditionally expanding nested partial orderings within the overall partial ordering of the function until a boolean variable is true (repetition).
    A Further Generalization of the Structured Program Theorem to "Stored Computational Lattices"


    Condition 1 of the new Structured Programming Theorem for Computational Lattices implies something deeper about parallel computation.

    Computing took a huge step forward when, in the 1930s and 1940s, the design of computers advanced from fixed-form programming (where the hardware of a computer was configured to compute only a fixed function over some input data, and where switches or patch cables were used to change the hard-wired logic of the computer) to machines that implemented the "stored program" concept. Under this concept, the function that a computer is hardwired to compute is actually the interpretation of instructions that arrive in the input data stream (i.e. are stored in the computer's memory), and these instructions operate on other instructions in the data stream (or other data in the memory). Turing called these instructions "computable numbers", laying the foundation for the concept of the universal Turing machine. It is impossible for us to appreciate what a brilliant breakthrough it was to store a computer's "program of activities" in memory alongside the data that those instructions acted upon.

    Note that the "stored program" concept takes the computing abstraction up one level, so that the exact computation performed (or the exact path that information takes through the computer, and the way information is transformed) becomes dependent not just upon the response of hard-wired but conditional logic as it responds to incoming data, but also to the exact list of instructions comprising the program stored in memory. In both cases, (with the exception of computations that do not require input data, such as listing the Fibonacci numbers,) the computation performed by the computer may be seen as "data driven", but in the second case, the program structure is also data.

    Everything covered so far about program lattices implies that the programming lattice has fixed form (with the exception that the 3rd condition of the newer version of the Structured Programming Theorem calls for expansion of sub-lattices until a given condition is met). However, from a parallel computing perspective, this could be seen as analogous to the fixed-form computing of the 1930s: the partial ordering itself that determines the ordering of computed operations is fixed, and cannot be changed at runtime, except by conditional logic (i.e. in response to input data). It is conceivable that making the topology of the computing lattice itself a stored structure will yield a similar increase in power of expression to the introduction of stored programs. Conceptually, this requires a "stored lattice" version of Flow to produce native code that can read not only instructions but also the data dependency graph associated with the instructions from memory, and respond by optimally scheduling work units and correctly ordering and synchronizing data access. This is in contrast to a fixed-lattice version of Flow that produces code with hard-wired ordering and synchronization logic (i.e. a fixed-form pipeline).

    A "stored lattice" version of Flow would be useful for computing over data that is intrinsically DAG-structured. For example, consider a Flow cross-compiler that examines a Flow program, and finds code in the program that is already implemented in a library routine, and then refactorizes the code to use the library routine instead of the long-winded version (e.g. the cross-compiler could convert some recursive routine into a fold operation). This cross-compiler will need to look at all sub-lattices of program structure in turn, comparing them to the lattices of all library routines. However, some library routines will be called by (or subsumed within) other library routines, so, optimally, the cross-compiler will need to compare first to the smaller library lattices and then to subsequently larger lattices. In fact, for any program or set of libraries, it would be possible to build a DAG data structure that shows which code snippets are called by which, or which code snippets are entirely subsumed within which. The cross-compiler's job is to then start with all the smallest library snippets and search within the user's programs for matches for these first, and to then search for larger library snippets that completely contain those smaller snippets. The ordering of these comparisons is DAG-structured, and Flow should be able to order its search strategy according to this structure without the user having to write any ordering or synchronization code.

    However, the most significant reason to implement a "stored lattice" version of Flow is that it will allow for Futamura Projection -- the distinction between static and dynamic program structure will be able to be blurred, and parts of both the compiler and an input program to be compiled will be able to be partially evaluated together without distinction.


    The Five Main Challenges of Parallelism

    The Future of Computing Performance: Game Over or Next Level?, p.82, states there are five main challenges to parallelism:
    1. Finding independent operations
    2. Communicating between operations
    3. Preserving locality between operations
    4. Synchronizing operations
    5. Balancing the load represented by the operations among the system resources.

    Solving challenges 1 and 4 is trivial with Flow, because the data dependencies are explicit and known at compiletime. Challenges 2, 3 and 5 can all be solved relatively easily by the compiler once challenges 1 and 4 are solved, and are easier to solve implicitly in Flow than in other programming languages because of Flow's link to Category Theory and because of Flow's knowledge of the big-Oh complexity of each morphism. This big-Oh complexity can be a function of architectural parameters and datasizes, changing the parallelization strategy dynamically at various data size thresholds so that the most optimal strategy is always chosen regardless of the input data.


    IMPERATIVE PROGRAMMING


    Push vs. pull, scatter vs. gather, imperative vs. functional

    In the past, imperative programming has always been much harder to programmatically reason about than functional programming (in fact, static analysis of the parallelizability of imperative code is undecidable in the general case).  Imperative code has therefore always been much harder to implicitly parallelize than purely functional code, which is directly parallelizable.

    As described above in the treatment of Condition 2 of the Structured Program Theorem, programs can be thought of as push or pull.  This could also be termed "scatter or gather".  Broadly speaking, and functional programs both gather values (operands) but imperative programs also scatter values by pushing them to specific locations.  All programs must implement both fanout and fanin in order to satisfy space complexity requirements of computation: functional programs implement fanout indirectly by evaluating many different functions, each producing a new result, or by evaluating the same function many ways with different parameters; imperative programs implement fanout directly.

    If-statements (as needed for Condition 2 above) act differently in two different situations:
    • In "pull"-computation, an if-statement simply selects the value it needs and ignores the other value.  Lazy evaluation can be used to pull computation through the lattice as needed so that unused values are never calculated.
    • in "push"-computation, a value is pushed to one location or another depending on the value of the if-statement's condition (or in other cases, e.g. with a "then" block but not an "else" block, it is pushed to a location or not pushed to the location depending on the condition).  However every node in the lattice must be initialized before it is read by its own dependents and can only be written once, so in the push model we must handle the case where a node does not receive a value at all, or receives values pushed by multiple other nodes.  In push computation, the target node of one or more push operations receives zero or more values from one or more dependencies, and therefore the value at the node is actually a collection of zero or more values.  The collection is finalized once all dependencies have completed their own computation (pushed all the values they are going to push), and when the collection of pushed values is finalized (immutable) at the recipient, then it may be read in the normal manner.  The receiving node can also be thought of as a channel.  The collection is unordered by default (a multiset), because no guarantee can be given about the order the values arrived at the receiver.

    "Safe" parallel imperative programming is possible in Flow

    Push and pull models of computation are both important in practical software engineering, and both are supported by Flow if the above constraint is imposed on the result of a push operation.  This distinguishes Flow semantics from those of other implicitly parallelizing languages, because "safe" parallel imperative programming is possible in Flow: safe in the sense that imperative statements cannot cause race conditions.

    This is immensely useful for an implicitly parallelizing language, because programmers are used to writing imperative code (and many programmers find it hard to get their head around using functional programming languages for real-world problems), and previous to Flow, most of the efficient, truely implicit parallelizing languages forced the programmer to use purely functional style, or the language could not offer safety guarantees about the ordering of computation.

    The syntax of any programming language based on Flow semantics can further restrict its imperative syntax to also eliminate side-effects, if desired.  However if these sorts of restrictions make the language harder to use, where possible the programming language should "give back" a safer close equivalent of the desired functionality. A lot of programmers are either not trained in functional programming, don't like the restrictions of functional programming, or simply want the freedom to break the rules of functional programming when they need to to get things done in the most practical way possible (it is just quite unsafe to do so in most programming languages).

    A lot of imperative-like functionality can be built on top of the Flow framework.  For example, it was stated above that values are immutable.  However if a variable needs to be written to multiple times, representing different values:
    • Each time the variable is written to, a timestamp is incremented.
    • The imperative concept of "get the current value of the variable" is replaced by "get the value of the variable with timestamp t".
    • The variable, because it is written multiple times, is actually therefore a list of timestamped values, with the index into the list acting as the timestamp.  A value cannot be read from the variable without specifying which timestamp you are reading from, which provides ordering guarantees.
    • It is possible to get the last value of the variable but not the current value of the variable: trying to read the last value blocks on finalization of the list of values (which happens after the last time the list is written to).
    • Compare this behavior to the recipient of a push operation above: the only difference is that push operations are not timestamped and so the result is an unordered multiset rather than an ordered list.
    • Also recall that to satisfy Condition 3 of the Structured Program Theorem and to allow execution of a loop until a boolean condition is true, the (n+1)th value of a list is produced from the (n)th value, until no more values are added to the list: effectively we are using the same timestamping mechanism to "update a boolean condition variable" until the condition is false, at which point subsequent values are no longer pushed onto the end of the list.
    In the context of the above model of imperative programming, it makes more sense why side-effects (and variable rewriting in general) can wreak havoc: when you write to a variable, you update its timestamp, possibly unbeknown to the rest of the program; each time you call a procedure with side effects it is referring to a value with a timestamp that is unknown a priori, or at least a timestamp that cannot in general be determined at compile time; and is no longer possible to refer to the value with the old timestamp, so you may end up dropping references and leaking memory.

    A lot of compiler optimizations can be performed to make this timestamping mechanism optimal: for example, if only the (n)th and (n+1)th values of a list are ever read/written, then the "list" can simply be two variables or structs that are alternately used, removing the need for lots of malloc/free operations.

    "Recommendation: Invest in research and development of programming methods that will enable efficient use of parallel systems not only by parallel systems experts but also by typical programmers" -- The Future of Computing Performance: Game Over or Next Level?, p.99

    CATEGORY THEORY APPLIED TO LATTICES

    Category theory: sets, functions and maps

    Category theory is the study of the mappings (morphisms) between mathematical objects.  In the category of sets, the mathematical objects are sets and the morphisms are functions.  Because mathematical functions are the building blocks of computable functions (again, as defined by the Structured Program Theory above), category theory provides rich tools for the study of programming languages.

    Flow uses basic principles of category theory extensively.  Basically each node (value) in the program lattice represents a morphism that maps from the cross product of the ranges of its dependencies to its own range.

    In this sense, a Flow program is like a big MapReduce pipeline: a MapReduce operation is basically just a morphism. Any function can be decomposed into a surjection and an injection, as can one or both the map and reduce steps of a MapReduce operation.

    Properties of domains and morphisms are tracked closely so that the properties can be used for inference of program properties (similarly to the Fortress program language):
    • The domain of each node in a program lattice is tracked in as much detail as possible.  For example, if i is in the range [0,10) and j = i+1, then the range of j is [1,11).
    • The properties of morphisms are tracked in as much detail as possible.  For example, if a morphism is surjective and injective, then it is a bijection and therefore it is invertible.  Also the big-Oh complexity of morphisms is calculated where possible, and this is used for parallel partitioning purposes.
    Note that the word "domain" describes the set of values that can be computed for a given node of the program lattice.  However it is an overloaded word, "domain" is the input of a function and "range" (or sometimes "codomain", though that can have a slightly different meaning) is the range of values of its output.  Generally the term "domain of a function" will be used to differentiate it from the set of values a specific node in the program lattice can take.

    The following attributes of a domain are tracked:
    • Type of a single value in the domain: string, integer, float, tuple, struct, list, map, numeric interval, etc.
    • Are the actual values in the domain known or unknown a priori?  (Domains may consist of unknown values if the values in the domain depend on input data, or known values if the values are literals.)
    • Does the domain have a natural ordering or not?  (For collection types, is the specific collection instance ordered or not?)
    • For numeric interval types: is the range countable (iterable) or not?  Is it open or closed on each end?
    • Are duplicate values allowed to be present in the domain or not?
    • Is the domain sparse or non-sparse?
    • What is the basic set-theoretic relationship of this domain to other domains in the program?  e.g. one domain may be the union of two others.
    • Time complexity of reading/writing/mallocing/freeing values in the domain (note that values can themselves be collections and may have nontrivial time complexity).
    The following attributes of a morphism are tracked:
    • The domains (set of values, as above) of both the domain and the range of the morphism.
    • Whether the morphism is surjective and/or injective
    • Whether the function is partial (undefined for some domain values, or causes exception/infinite loop for some domain values) or total
    • Associativity and commutativity of the morphism, and distributivity over other morphisms where applicable
    • For morphisms that are binary relations: reflexivity, symmetry, transitivity (if all three are true, it is an equivalence relation)
    • The identity element of the morphism, if it has one
    • Time complexity of applying the morphism (potentially as a function of some attribute of values in the domain, such as the list length if the domain is a set of lists).
    See "Optimization of Implicit Parallelization" at the end for an idea of the sorts of ways these attributes are used to increase the parallelizability of a program. However, in essence, the idea of tracking the above attributes is that if you understand the freedom you have to group or reorder computations in different ways, and if you understand the mathematical meaning (not just the effect) of composing computations in different ways, it gives you more degrees of freedom over which to optimize implicit parallelization.  Tracking these properties of functions and their domains and ranges should open up an entirely new field of compiler optimization research wherein people will be able to apply the principles of category theory to write better optimizers.

    Exceptional domain values

    Domains are actually a union of their own values and a (possibly empty) set of exceptional values, for example #Null, #Nothing (from the Maybe type, different from #Null), #NumberFormatException, #NaN etc.  Programmers can return their own exceptional values from a morphism, which adds it to the set of exceptional values of the morphism's range.  Because domains are closely tracked through static analysis, exceptional values (like nulls) can't fall through the cracks, and have to either be handled by or passed through a morphism (passing an exception value through a morphism simply adds the exceptional value to the range of the morphism, so that anything depending on the range then has to handle the value as a special case, or further pass it along).

    This particular design consideration allows data to continue flowing through potentially massive Flow pipelines with minimal global disruption from a local exceptional event, and forces the programmer to deal with a problem as early as possible in the code before it can cascade and become difficult to debug.  An exception is just another value that flows through the pipeline like non-exceptional data.

    Because null values are tracked as an exceptional value #Null, there is no chance of NullPointerExceptions or segfaults, because anything that can take a null value must be handled explicitly.  In the same way, the possibility of generating division by zero errors can be statically checked for by seeing if the domain of the numerator of a division could contain the value 0.  If so, the programmer has to handle the zero case separately before the division will be allowed by the compiler (because the domain in all other cases no longer includes zero).

    Morphisms and map/function duality

    A map (e.g. Java's HashMap) is a morphism, as is a regular function.  In fact a map can be used as a function and a function can be memoized into HashMap form.  Either incremental or whole-domain memoization can be triggered when the big-Oh analysis of a function call (number of calls and time complexity of the call vs. size of the domain, if the domain is countable) justifies one of the two approaches.

    Because morphisms and maps are basically the same thing, you use them the same way, so you can compose maps the same way you compose functions: f o g(x) = f(g(x)).  This maps the domain of g (of which x is a member) to the range of f, or if f and g are both maps, this "pipes" the range (values) of g through the keyspace of f, effectively mapping the domain of g to the range of f.  The 'map' function from functional programming and 'foreach' from imperative programming then basically becomes the same thing as composition.

    Also when morphisms are composed, it is possible to tell if the result of composition is undefined for some input values (by looking at the range of one function and the domain of the other).  If the aggregate composition is potentially undefined for some value, an #Undefined exceptional value is injected into the range, and that must subsequently be handled explicitly or passed through as described above.

    Because direct static analysis of the data dependency lattice is possible in Flow, various morphism-specific optimizations are possible that are generally not possible in most other programming languages, for example map fusion.

    Practical morphisms

    A few examples of how morphisms could usefully map to real-world programming constructs:
    • HashMaps are obviously morphisms, but both lists and arrays can also be modeled as morphisms, with an ordered, countable domain (the index) mapping to the range of actual values in the list or array.  Linked lists have O(list.length()) complexity for applying the morphism (i.e. for getting a value at a specific index), and arrays and ArrayLists have O(1) complexity.  For arrays, the domain of indices can be either sparse or non-sparse, but for lists, the indices are always non-sparse.
    • Array or ArrayList elements can be accessed individually, or the entire array contents can be processed in parallel through operations like morphism composition as described above.  The prevalence of this pattern of parallel function application in the Flow paradigm gives rise to an array programming style for Flow programs.
    • File input: reading a textfile can be modeled as producing a morphism mapping line number to the string representing each line.
    • File output: writing to a textfile can be modeled as writing to a special type of list item that happens to be a file.

    OPTIMIZATION OF IMPLICIT PARALLELIZATION

    Flow has several unique features that allow it to not only implicitly parallelize, but to do so in an unusually optimal way.
    • By tracking the big-Oh complexity of morphisms and lattices of morphisms, Flow is able to determine how to optimally partition the work so that the communication and synchronization overhead does not outweigh the speedup obtained through parallelization.  Different partitioning schemes can be chosen at compiletime and/or runtime depending on expected or actual data sizes.
    • Based on big-Oh analysis, memoization can be automatically enabled for a time-intensive function when it is known that the number of applications of the function exceeds the size of its domain (in other words, when values will be used multiple times, suggesting that the values should be cached).
    • In fact if an input file is modeled as a morphism from line number to text line with O(n) random access time, if the file is used in random access fashion rather than read linearly, the very same memoization machinery just described would automatically produce an O(1) caching layer for the file once the expected number of random accesses was greater than some threshold percentage of the file size.
    • Flow also tracks the associativity of morphisms: this can allow for lower inter-thread communication overhead, assuming the domain is ordered *and* it is addressable in O(1) time, e.g. the string concatenation "a" + "b" + "c" + "d" + "e" + "f" + "g" + "h" could be split into two separate work units and the results later concatenated, ("a" + "b" + "c" + "d") + ("e" + "f" + "g" + "h"), because the string concatenation function is associative.
    • Flow tracks the commutativity of functions: if a function is commutative then work can be processed as it comes in, so originally-ordered data can be reordered on the fly without affecting the result (potentially speeding up pipelining). In general, the more ordering constraints that can be relaxed, the more highly parallelized the code can become, because you move further away from a total ordering.
    • Also, by tracking commutativity, it is possible to report a compiletime error if a non-commutative function is being used to fold values in an unordered domain: the user has to somehow sort the domain first before applying the function.  Conversely, if the function is commutative, it is possible to fold values from an unordered domain (e.g. you can fold values from a set and not just a list).  Combined with tracking associativity as described above, if a fold function is known to be commutative and associative, you can have many threads working to fold values from a partitioned set in parallel, and then combine the results at the end. (Data Parallel Haskell also provides a log(N) undirected parallel fold (foldP) that folds in a similar way and requires an associative function, but it is up to the programmer to guarantee the function is associative, and a special function has to be provided as opposed to the compiler figuring out the parallelizability itself.) By understanding properties of morphisms, inherently serial functions like fold can be automatically parallelized in a provably safe way.
    • Other properties that are tracked, e.g. surjectivity / injectivity and properties of binary relations, allow for greater flexibility of operations on morphisms while guaranteeing type safety: e.g. if you try to invert a surjective morphism from element type A to element type B, you will get a morphism from element type B to element type Set<A>.
    The mathematically-motivated optimization possibilities for Flow are endless. The Flow platform provides a rich new environment for experimenting with various dependency-slicing algorithms that attempt to maximize throughput in a program lattice.

    CONCLUSION

    The Flow programming paradigm requires the data dependency graph of a program to be directly determinable from its source code, which is directly equivalent to eliminating the concept of a variable's "current value".  The resulting dependency graph can be viewed as a lattice of morphisms, as defined in category theory.  This lattice is a partial ordering that directly yields a parallel execution plan for the different parts of the program, as well as a precise plan for allocation and deallocation of memory.  By tracking properties of domains and morphisms, and by applying basic category theory to these properties, Flow is able to safely parallelize some inherently-serial code, as well as modify parallel partitioning plans according to changing input data sizes.  The Flow programming paradigm will enable us to build the next generation of programming languages that will solve the multicore dilemma by enabling efficient and safe use of multiple cores with close to zero extra programmer effort.


    (Style guideline: include the keyword "flowlang" somewhere on any web pages that talk about Flow, so that search engines can find information specifically about the language.)



    Related