Solving the Multicore Dilemma


How bad is the multicore dilemma?

Very few people are aware of the fact that the world is headed towards a massive software train-wreck:
as per-core speeds plateau, we are not just headed towards a plateau in software speed, but, in our desire to continue the progress of Moore's Law, the human inability to write good multithreaded code is actually leading us towards an era of significantly buggier software.

The urgency of this problem is starting to become apparent:

"Finding: There is no known alternative for sustaining growth in computing performance; however, no compelling programming paradigms for general parallel systems have yet emerged." (p.81)

"Recommendation: Invest in research and development of programming methods that will enable efficient use of parallel systems not only by parallel systems experts but also by typical programmers" (p.99)

--The Future of Computing Performance: Game Over or Next Level? Fuller & Millett (Eds.); Committee on Sustaining Growth in Computing Performance,, National Research Council, The National Academies Press, 2010

"6. Bring Back Moore's Law: The last 10 years have reminded us what Moore's Law actually says. Till about 2002 you could safely misinterpret it as promising that clock speeds would double every 18 months. Actually what it says is that circuit densities will double every 18 months. It used to seem pedantic to point that out. Not any more. Intel can no longer give us faster CPUs, just more of them.

"This Moore's Law is not as good as the old one. Moore's Law used to mean that if your software was slow, all you had to do was wait, and the inexorable progress of hardware would solve your problems. Now if your software is slow you have to rewrite it to do more things in parallel, which is a lot more work than waiting.

"It would be great if a startup could give us something of the old Moore's Law back, by writing software that could make a large number of CPUs look to the developer like one very fast CPU. There are several ways to approach this problem. The most ambitious is to try to do it automatically: to write a compiler that will parallelize our code for us. There's a name for this compiler, the sufficiently smart compiler, and it is a byword for impossibility. But is it really impossible? Is there no configuration of the bits in memory of a present day computer that is this compiler? If you really think so, you should try to prove it, because that would be an interesting result. And if it's not impossible but simply very hard, it might be worth trying to write it. The expected value would be high even if the chance of succeeding was low."

Paul Graham, Frighteningly Ambitious Startup Ideas, March 2012

Paul Graham is right, we need "the sufficiently smart compiler". It's the only way to solve the multicore dilemma -- better parallelization frameworks (including java.lang.concurrent, Hadoop, Erlang-style message passing, Futures etc.) buy us some time, but they are not the long-term solution, because they require shoehorning our design into the way the framework works. We need to be able to continue writing programs that specify what we want done, not how to break up the workload into pieces that can be run in parallel -- the compiler should worry about that. Furthermore, parallelization decisions should be able to be made from very fine-grained to very coarse-grained levels in a manner transparent to the programmer: the unit of parallelization shouldn't be some framework datastructure, the same approach to parallelization should apply to each and every part of a program.

To answer Paul Graham's question, no, writing "the sufficiently smart compiler" is not impossible. In fact, it is simple to prove mathematically that the sufficiently smart compiler can in fact be built -- if and only if we minimally restrict the syntax and semantics of modern programming languages in a very specific way to make it possible to construct such a compiler. That means, I claim, that the sufficiently smart compiler cannot be built to handle the general case if we don't do the one thing I describe below differently. Even though very, very smart compilers can no doubt be built, they will handle only "most" cases, not all cases. I would hate to see the rat's nest of code required to intelligently parallelize most modern programming languages... and I would hate to trust my life to code that might fall over if it screws up and a race condition inadvertently does occur, because the compiler, in all its complexity, gave incorrect guarantees about safeness.

Another name for "the sufficiently smart compiler" is a compiler that is capable of precise implicit parallelization. It would also be nice if we could write a compiler that could target many different models of parallel computation, from CPUs to GPUs to Hadoop clusters:

Holy Grail: "write once, parallelize anywhere"


Pure functional languages are implicitly parallelizable

Pure functional programming languages are implicitly parallelizable by nature, because two different invocations of a function cannot interfere with each other -- there are no side effects. There are several compilers for pure functional programming languages that make use of this fact to execute different non-interdependent parts of a program (or applications of a function to different elements of a list or collection) in parallel.

For example, consider the following call to the function 'map' in Haskell:

y = map sum [[1,2,3], [4,5,6], [7,8,9]]

This invocation of the function 'map' separately computes the sum of each of three sublists, i.e. applies 'sum' to the sublist [1,2,3], then the sublist [4,5,6], then the sublist [7,8,9], and then builds a list out of the three results, producing y = [6, 15, 24]. There is no reason why, for longer sublists, each of these invocations of 'sum' on a sublist couldn't be run in parallel with the others, because they don't depend upon each other, and can't interfere with each other.

Implicit parallelization works nicely in functional programming languages, and is provably safe -- but most programmers are not capable of being productive in pure functional programming languages, so they continue to get their work done in imperative programming languages like C/C++, Java or Python. This presents a problem if the solution to the multicore dilemma is supposed to be "just use a pure functional programming language".

Imperative languages are not, in general, implicitly parallelizable

Imperative languages have side effects, meaning in the simplest case that calling a method twice with the same parameters may not return the same result both times, depending upon the value of some external state (e.g. the member variable of a class) that the method has access to.

The following code will return a different value each time if addToTotal(5) is called twice. Furthermore, if the operation "+=" is not atomic, then in a multithreaded environment, some values "num" may be lost (not added to the total) if they are overwritten by another "+=" operation between the read, add and write steps -- and another potential race exists because "total" is read twice, once during "+=" and once by the return statement.

public class Counter {
    // External state => addToTotal() has side effects
    private int total;

    // Not a pure function -- and not threadsafe either
    public int addToTotal(int num) {
        total += num;
        return total;
    }
}

Side effects make it hard for a compiler to predict ahead of time what the behavior of a function will be -- even in a single-threaded environment -- because the precise data dependencies in a program can't be known until runtime. More precisely, the only way of actually determining what exact decisions an imperative program will make on a given input (again, even in the single-threaded case) is to actually run the program. It is actually uncomputable in the general case to try to predict what an imperative program will do without actually running it: predicting the precise control flow in the general case is as hard as the halting problem.

Figuring out what a program's behavior will be is made even more difficult when you introduce multiple threads of execution. In particular, if you don't enforce an ordering between the computations running in different threads, you can get race conditions, where a program may do two or more different things (including some intended behavior as one thing, and potentially crashing as another unintended thing) depending on the exact order in which certain operations complete. This situation typically arises when the programmer confuses compiletime (static) and runtime (dynamic) data dependencies -- they refer to a specific variable in the source, and expect it to have a certain value at the time it is accessed, but the value of that variable changes underneath them before that time, or doesn't correctly change to the value they are expecting until after they have read it.

The real difference between imperative and functional programming languages: scatter vs. gather

In spite of the fact that functional programming languages are typically defined as languages with pure functions (functions that do not have side effects), and imperative programming languages as languages that have side effects, there is another way to look at the difference between the two: functional programming languages exhibit a "gather" or "pull" mode of computation, while imperative programming languages add to that an additional "scatter" or "push" mode of computation. Functional programming languages pull values from the results of previous computations into the parameters of a function, compute something from those inputs, and present the results as the function's return value, which can subsequently be pulled into other computations. Return values of a specific function invocation are only computed once, and are only accesible for reading once they have been computed and finalized -- therefore functional programming languages deal with values, not variables. The simplest push operation in an imperative programming language is "write this value to this variable, and do it right now" -- and that write operation to one specific variable can be performed multiple times from within a single thread, or even from several different threads.

Consider the following Java code example:

/** Produce a histogram of the number of instances of
 *  each char

int[] histogram(char[] input) {
    int[] hist = new int[Character.MAX_VALUE];
    for (char c : input)
        hist[c]++;
    return hist;
}

This builds a histogram of how many times each character occurs in the input array -- it effectively "scatters" incremental counts of +1 to the array location hist[c] for each character c in the input array:


This is a very natural way to program, but it leads to code that can't easily be implicitly parallelized: if we can't assume the increment operation "++" is atomic, then the compiler would have to do some tricky things here with thread-local storage to avoid lock contention if the work for each thread were to be split across chunks of the input array.

(Note that this particular operation could still be performed in a similar way in a functional programming language, without too much loss of speed, by effectively reversing the direction of each arrow and turning a scatter operation into a gather operation -- but this sort of inversion can be tricky to do, or tricky to do efficiently.)

The key point though, from the point of view of parallelizability, is that as imperative code gets far more complicated than the simple example above, it can becomes extremely complicated to write the heuristics to decide where to parallelize the code, and to know which operations are safe to run in parallel and which operations are not. In fact, as stated above, you can pick the low-hanging fruit and make a compiler that parallelizes some stuff safely, but in the general case for modern imperative languages, this is uncomputable -- i.e., in fact, impossible. If it's impossible at the limit to completely implicitly parallelize modern imperative languages, what then can be done?

Imperative programming languages allow us to indulge in some specific programming freedom that kills our ability to implicitly parallelize. It turns out that this is not just the ability to make use of side effects. To figure out what the specific issue is, we need to look more closely at the data dependency graph of a program.

The data dependency graph of a program: DAGs, partial orderings and lattices

The data dependency graph of a program is a diagram that shows which values (not variables) are required to compute which other values, and in which order. Since some value X must be available before computing another value Y as Y=f(X), and since values can't be defined circularly (e.g. X=Y+1 while at the same time Y=X+1, where X and Y are values, not variables), the data dependency graph must be directed and acyclic -- it must have the structure of a directed acyclic graph (DAG) or a partial ordering.

For the pedantic, a DAG is actually a lattice after you remove unnecessary nodes (i.e. once you trim unused computations). The rest of this document will interchangably refer to partial-orderings, DAGs or lattices, or will simply refer to the concept of this mode of computing as simply lattice-based computing.

A DAG is a partial ordering, as opposed to a total ordering, because if there is no directed path (following the arrows) from node P to node Q that passes through zero or more other nodes, then the graph says nothing about whether P needs to be computed before Q or vice versa -- in fact P could be computed before Q or Q could be computed before P, and it would not change the final output. In the below example, there is no defined order between b, c or d, so they can be computed in any order, as long as a is computed before any of b, c and d, and e is computed after both b and c etc.

a = 5
b = f(a)
c = f(a * 2)
d = g(a) + f(a)
e = b + c
f = d / e

The data dependency graph of the program can be depicted as follows. (Note that in this particular diagram, arrows indicate "depends upon", which is the opposite direction from the arrow of time. The lines could also be drawn reversed, indicating the direction that data flows):


According to this data dependency graph, reordering the lines slightly can still produce a valid program that produces the same result:

a = 5            
d = g(a) + f(a)  
c = f(a * 2)     
b = f(a)         
e = b + c        
f = d / e       

In fact there are 3! = 6 valid single-threaded programs in total just based on permuting b, c and d, all of which produce the same exact result -- but also, note that d can be computed before or after e, not just before or after b and c (because there is no directed path from d to e or vice versa).

Each of these equivalent programs is a sequential series of lines that is consistent with the topological sort of the data dependency DAG (it is a linear extension of the partial ordering) such that for any edge X→Y in the DAG, Y comes before X in the program ordering.

The linear order of program lines is completely artificial

This is familiar to programmers: you typically can't use a value or variable before it is defined. However, this also illustrates an incredibly interesting point: the programmer often has a lot of freedom to move lines around inside a program, within the ordering constraints, and the final computed value is not affected. Therefore, when we write code, the serialized ordering of the program lines is in many cases completely artificial -- the programmer is merely being asked to perform a topological sort in their head.

This is not only pointless -- we should probably one day be editing code directly, graphically, as a DAG -- but it also obscures opportunities for parallelization.

Languages are implicitly parallel if and only if the data dependency graph of any program in the language can be determined statically

The true underlying data dependency graph of values in a program will always be a DAG because of the arrow of time, even if the structure of the graph of variable references that can be observed in a program's source is not in fact a DAG (as in the general case for imperative programming languages).

For imperative languages, the true underlying data dependency graph of a program cannot be determined at compiletime (i.e. statically) because it is impossible for a compiler to predict the value of a variable at some future time without actually running the program. For functional programming languages, the data dependency graph can be precisely known at compiletime, because the structure of the program is the data dependency DAG.

This is the actual, necessary and sufficient, underlying reason why functional programming languages are implicitly parallelizable and imperative programming languages are not. Implicit parallelizability only indirectly has to do with side effects.

Crossing the chasm: implicit parallelization without the pain of functional programming

Programmers don't actually need side effects per se, they just need to be able to do certain things in natural ways. Is it possible to build a programming language that feels imperative, but is just as implicitly-parallelizable as a functional programming language?

To attempt this, we need to figure out the minimal way to restrict an imperative programming language such that its data dependency graph can be statically determined.

Maybe a little obtusely, it turns out that Einstein was right: when observers are physically separated (implying that communication between observers requires greater-than-zero time), simultaneity becomes a relative concept. Race conditions happen in imperative programming languages when correct partial orderings are not enforced, or when the ordering can't be guaranteed due to the relative observer effect of running code on multiple, non-omniscient cores simultaneously. In multithreaded scenarios, simultaneity, or the concept of "now", becomes a relative thing. And, therefore, data dependencies must be explicitly and statically encoded in the structure of a program itself to be able to give a guarantee that a compiler can optimally implicitly parallelize the program.

Following this logic, an equivalent way of stating the fact that a program's data dependency graph can be statically determined (and that, therefore, the program can be implicitly parallelized) is exactly the same as stating that there is no such thing as the concept of "now" in the program. Let me state that boldly:

The specific minimal way that an imperative programming language must be restricted to make programs written in that language precisely implicitly parallelizable is to make it impossible to read the current value of a variable.

Isn't reading the current value of a variable one of the most fundamental things you'd ever want to do in a programming language? What are the alternatives to reading the current value?
  1. Reading the only value that a variable will ever take. This implies immutability (dealing with constant values, not variables), which yields pure functional programming. A value can't be used until it is computed, and once it is computed, it can't change.
  2. Reading the value of a variable at a specific (local) timestamp. This implies recurrence relations, e.g. x' = x+1 or x[t] = x[t-1] + 1. Adding recurrence relations to a functional programming core allows you to add imperative-style for-loops back into your language. Recurrence relations enforce that a precise DAG ordering is enforced between operations, because you always have to specify which version of a variable (which specific value) you are talking about.
  3. Reading the set of all values that a variable will ever take on. If you try to write a value to a variable from multiple places, the type system in your language must constrain that variable to be an unordered collection, and the write operations must add those values to the collection. Everything that tries to read from the collection is scheduled to run after the last item is written and the collection is finalized.
Enabling the pushing or scattering of values into a collection (the third of the above alternatives) allows you to restore an imperative-like feel to your language. The simplest example of this is the histogram calculation code given above: in concept, for every character c, the value 1 is pushed into a bucket ones[c], then the sum of those 1s is computed to produce hist[c]. It actually doesn't matter what order the 1s come in, because addition is commutative -- so ones[c] can be an unordered collection of some sort (which is much more highly parallelizable than the original code hist[c]++). [Note the similarity here to MapReduce.] In some Java-esque language that implements a push operator, say '->', the histogram code might look something like the following:
/** Produce a histogram of the number of instances of each
 *  char, using a push operator '->'
 */
int[] histogram(char[] input) {
    // Create and initialize an array of sets
    var ones :: Set<Integer>[Character.MAX_VALUE];

    // Push a 1 to the set corresponding to each character
    // in the input array
    for (char c in input)
        1 -> ones[c];  // Basically same as ones[c].add(1);

    // For each c, sum all the 1s in ones[c] to produce
    // the result, hist[c]:
    var hist = map(ones, sum);
    return hist;
}

This produces a histogram by pushing a 1 to the right collection in an array of unordered collections (ones[]), then sums the contents of each collection using the sum function, as follows:


This may seem like an inefficient way to solve this particular problem, but expressing the problem this way gives the compiler a lot of freedom to decide how to parallelize this particular piece of code, and for more complicated cases, this can in fact the best (and most natural) way to express the idea of imperative computation. The compiler is also free to parallelize this code however it wants, including optimizing out the sets of ones and hoisting the sum function up past the push operator, to form the simple pattern of hist[c]++ inside a loop, if the compiler thinks parallelization is not worth it. In general, expressing imperative-style computation as scatter operations to unordered collections gives the compiler the freedom to apply a lot of neat algebraic transformations to the code to maximize the degree of parallelization.

Scatter operations also allow us to push values to a destination collection from several different places in a program. In this case, only after all possible writes to a given collection have completed can the contents of the container be read by something else. This provides a sequencing of operations upstream and downstream of a collection node in a DAG. Anything that reads from an unordered collection must then expect the values to be read in a random order, i.e. it will be a syntax error (caught by the type system) if you try to apply a non-commutative operator to the contents of an unordered collection in the order they are read out. Orderdness, therefore, along with algebraic properties of binary functions like commutativity and associativity, must be first-order concepts in the language's type system if the push operator is provided.

This is not a bad thing: it turns out that knowing these algebraic properties of functions in your language actually provides the compiler with new ways to parallelize your code that aren't even immediately obvious from the data dependency graph: the more degrees of freedom provided to the compiler by properties such as associativity and commutativity, the more the compiler can parallelize your code.

Note also the similarity between this push operator and the way that MapReduce operations work: in the example above, values (here just the value 1) are pushed to specific keys (here the character value), and then the contents of everything sharing a key is aggregated or reduced (here performed by the sum function). Lattice-based computing with push operations effectively turns an entire program (not just a few manually-parallelized operations) into a fine-grained MapReduce pipeline, with zero programmer effort. The resulting code can be parallelized locally across the available cores or on a Hadoop cluster, it doesn't matter. The compiler would provide backends to generate code for both.

Lattice-based computing yields several nice properties

Often, a sign of a sound design decision is that it causes a great many other things to fall neatly into place. If we statically restrict a program's data dependency graph to be a DAG or lattice, a whole slew of other nice language properties cascades into place:
  1. Precise implicit parallelization: As has been the focus of this document so far, lattice-based programming languages allow a compiler to figure out which branches of a program DAG can be executed in parallel. Note that Hadoop / MapReduce pipelines are actually themselves DAGs -- the parallelization strategy described here is much more fine-grained and general than MapReduce, and pervasive within the structure of the language, but a DAG program can be seen as a big MapReduce pipeline. However, in spite of the compiler being free to choose how and where to parallelize, in general there will be many different ways to slice a program such that the maximal use of CPU cores is obtained at runtime. It may be hard to choose a priori where to draw the line between what should run on each core to ensure optimal performance. Fortunately, the fact that the program's data dependency graph is available at compiletime also gives the compiler the following ability:
  2. Precisely static computation of runtime and storage requirements (big-Oh notation): for each node in the DAG, runtime needed to compute the contents of the node and size needed to store its contents can be pre-computed (or at least approximated) by the compiler as a function of input data sizes. This allows the compiler to try several different parallelization strategies, or even several different types of equivalent data structure or equivalent algorithm with different runtimes and storage requirements, and can switch between them at compiletime or runtime depending upon input datasizes. The compiler can, for example, decide to run a serial version of an algorithm below a certain datasize (when parallelization overheads are too high), and switch to a parallel version above a certain datasize. Note that it is also the ability to calculate the big-Oh complexity of a compute operation and associated communications that will allow the compiler to generate optimal code for a range of different parallel architectures (CPUs vs. GPUs vs. compute clusters), giving us the unique ability to write once, parallelize anywhere.
  3. Precise implicit memory allocation and deallocation: Even better, in a lattice-based programming language, there is no need for the programmer to explicitly call malloc/free (like in C), new/delete (like in C++) or new followed by relying on the garbage collector (like in Java): the compiler produces code that allocates the memory for a given node in the DAG right before it is due to be computed, and frees the memory for that node as soon as everything that depends upon that value has read from it. There is no wasted memory due to collections being held in-scope long after the last thing referring to them has read from them, and there are no unpredictable GC pauses (which is the reason Google won't use Java for a lot of mission-critical stuff).
  4. No NullPointerExceptions, dangling pointers etc.: NPEs and dangling pointers arise from the same roots as race conditions. Simple range-checking combined with the static knowledge of the data dependency DAG completely eliminate these.
Lattice-based computing solves many longstanding software engineering problems in one fell swoop

Examining these nice language properties that cascade from the single decision to statically enforce a DAG or lattice structure on the data dependency graph of a program, we can see that a large number of longstanding problems in software engineering are directly solved:
  • The multicore dilemma: this is solved in lattice-based programming by simply parallel-scheduling computation in compliance with the partial ordering of the program lattice.
  • Parallelization across heterogenous architectures: in lattice-based programming, the compiler knows the big-Oh costs of different operations on each type of architecture, so it can generate an optimal parallelization strategy on each type of architecture.
  • Race conditions: a DAG is a partial ordering over values, not variables, so there is no ambiguity in timing or ordering -- i.e. the "D" in "DAG" ensures there are no race conditions.
  • Deadlocks: there can be no deadlocks, because there are no cycles -- i.e. the "A" in "DAG" ensures there are no deadlocks.
  • Uninitialized memory: it is impossible to access uninitialized memory, because an operation that wants to access a given value in a lattice isn't even scheduled to be run until the value is computed, finalized and available to be read.
  • Manual memory management: the hassle of managing your own memory completely goes away with lattice-based computing, because the compiler does it all for you. You just use values wherever you want to use them, and the compiler allocates and frees stuff for you in just the right places in the generated code.
  • Dangling pointers: it's impossible to try to refer to something after it has been freed, because it wouldn't have been freed if you still have a reference to it.
  • Multiple-free: it's impossible to try to free something twice, because the compiler frees things exactly once for you, as soon as a value or collection is no longer needed.
  • Memory leaks: it should be impossible to leak memory in the traditional sense, because you can't lose a reference to allocated memory if it still needs to be read in the future.
  • Memory inefficiency (keeping things in memory longer than needed): references -- even references defined in long-lived scopes -- are not held any longer than needed. Even garbage collected languages can't solve this problem in the general case (stack frames may unnecessarily hold references long after an object is no longer needed), because garbage collectors cannot anticipate future usage or non-usage of references on the stack frame. In Flow, objects can be freed as soon as they are no longer needed, even if the containing scope is still around, i.e. the ability to free an object is not limited to the point when references are popped off the program stack.
  • NullPointerExceptions / segfaults: these are normally runtime errors, but they can be caught at compile-time with proper typing and range checking. This should be easier to perform with lattice computing than with a general imperative programming language, because the dataflow model is well-defined and statically determinable.
Lattice-based computing presents a completely general model of computing that should scale from CPUs to GPUs to massive compute clusters: there's no reason you shouldn't be able to generate pthreads code, GPU/CPU hybrid code and Hadoop code from the exact same source, potentially providing to programmers the Holy Grail ability to write once, parallelize anywhere.

Hopefully this plays out as well in practice as in theory :-)  A programming language that implements these ideas, currently named "Flow" ("flowlang") is in the early stages of prototyping. (Yes, it is currently vaporware -- but it is in active development.) For gory details, see the older document The Flow Manifesto.

--

Luke Hutchison
2012-03-10