Wednesday, May 16, 2012

The multicore dilemma (in the big data era) is worse than you think

I'm currently applying for a grant to work on the Flow programming language. I thought I would post the following excerpt from the abstract and the background section, particularly because it presents a number of aspects of the multicore dilemma that are not discussed frequently enough.

Abstract

The arrival of the big data era almost exactly coincided with a plateau in per-core CPU speeds and the beginning of the multicore era, and yet, as recently stated by the National Research Council, no compelling software model for sustaining growth in parallel computing has yet emerged. The multicore dilemma, therefore, presents the biggest near-term challenge to progress in big data. We argue that the multicore dilemma is actually a substantially worse problem than generally understood: we are headed not just for an era of proportionately slower software, but significantly buggier software, as the human inability to write good parallel code is combined with the widespread need to use available CPU resources and the substantial increase in the number of scientists with no CS background having to write code to get their job done. The only acceptable long-term solution to this problem is implicit parallelization -- the automatic parallelization of programmer code with close to zero programmer effort. However, it is uncomputable in the general case to automatically parallelize imperative programming languages, because the data dependency graph of a program can’t be precisely determined without actually running the code, yet most “mere mortal” programmers are not able to be productive in functional programming languages (which are, in fact, implicitly parallel). We need a new programming model that allows programmers to work in a subset of the imperative style while allowing the compiler to safely automatically parallelize our code.

Background

The arrival of the big data era has almost exactly coincided with a plateau in per-core CPU speeds [see figure below]. This may be the biggest near-term challenge in the big data era.

The intent of the background provided here is not to simply rehash what is already widely known about the apparent demise of Moore’s Law or the arrival of the big data era, but to describe an uncommonly-discussed set of issues that arise due to the co-arrival of these two phenomena, and to demonstrate the vital importance in this context of lattice-based computing, our specific proposed method for solving the multicore dilemma, which safely adds automatic parallelization to a maximal subset of the imperative programming paradigm in order to be accessible to the average programmer.

Processor performance from 1986 to 2008 as measured by the benchmark suite SPECint2000 and consensus targets from the International Technology Roadmap for Semiconductors for 2009 to 2020. The vertical scale is logarithmic. A break in the growth rate at around 2004 can be seen. Before 2004, processor performance was growing by a factor of about 100 per decade; since 2004, processor performance has been growing and is forecasted to grow by a factor of only about 2 per decade. Source: [Fuller & Millett 2010]


We can’t “solve” big data without solving the multicore dilemma

The only acceptable path forward for big data is multicore (and cluster) computing, therefore problems in big data cannot be separated from the multicore dilemma [1]: the only way to handle an exponential increase in data generated each year is to exponentially increase our efficient usage of parallel CPU core resources.

We can’t solve the multicore dilemma without solving automatic parallelization

A great many techniques have been developed for parallel computing, however all of them are too error-prone, too unnatural for the programmer to use without significant design effort, require too much boilerplate code, or require too much computer science expertise to use well. The recent National Research Council report, “The Future of Computing Performance: Game Over or Next Level?” by the Committee on Sustaining Growth in Computing Performance stated: "Finding: There is no known alternative for sustaining growth in computing performance; however, no compelling programming paradigms for general parallel systems have yet emerged." [Fuller & Millett 2010, p.81]

Ultimately, no matter what kinds of parallel computing paradigms may be created [2], they will still be paradigms, which will require programmers to adapt their own algorithms to properly fit within. Long-term, the only sustainable approach to parallelization is the one that requires close to zero programmer effort: “Write once, parallelize anywhere”.

The lead authors of the NRC report observe [Fuller & Millett 2011] that “The intellectual keystone of this endeavor is rethinking programming models so that programmers can express application parallelism naturally. This will let parallel software be developed for diverse systems rather than specific configurations, and let system software deal with balancing computation and minimizing communication among multiple computational units.”

Without finding the appropriate computing model for generalized automatic parallelization, some great challenges lie ahead in the big data era. In fact, if we plan to simply rely on explicit programmer parallelization, the problems that arise from simply trying to make use of the available cores may prove significantly worse than the lack of efficiency of single-threaded code, as described below.

The multicore dilemma meets big data: incidental casualties

It is widely accepted that we are currently facing not just the apparent imminent “demise of Moore’s Law” [3], but also the arrival of the big data era. However, there are several important ramifications of the confluence of these two phenomena that have been widely overlooked:

  1. We are facing an impending software train-wreck: Very few people have considered the fact that as per-core speeds plateau, we are not just headed towards a plateau in software speed, but in our desire to continue the progress of Moore's Law, the human inability to write good multithreaded code is actually leading us towards an era of significantly buggier software. If we do not solve the multicore dilemma soon, many of the 99.99% of programmers who should never be writing multithreaded code will need to do so due to the arrival of massive quantities of data in all fields, potentially leading to a massive software train-wreck.
     
  2. We are facing an impending software morass: The world may also be headed for an unprecedented software morass, as all programmers and software development teams that care about performance begun to get mired in the details of having to multithread or otherwise parallelize their code. All current platforms, libraries and techniques for making use of multiple cores introduce the following issues:
     
  1. significant overhead in terms of programmer effort to write code in the first place -- the data dependencies inherent in the problem must be determined manually by the programmer, and then the programmer’s design must be shoehorned into the parallelization model afforded by the framework or tools being employed;
     
  2. significant increased complexity to debug, maintain and extend a multithreaded application once developed, due to increased boilerplate code, increased logical complexity, increased opportunity for introducing very hard-to-trace bugs (race conditions and deadlocks), and due to the fact that the presence of a significant amount of parallelization code, structured in an unnatural way, obscures the original programmer’s intent and the design of the program;
     
  3. a significantly longer iterative development cycle time, which can cause large-scale projects to break down and ultimately fail; and
     
  4. significant extra economic cost for employing teams to write and maintain parallel code.
     
  1. We urgently need automatic parallelization for the hoardes of “mere mortal” programmers that are coming online: We urgently need a programming language for the masses that does not require a background in parallel and distributed systems. The world has not only a lot more data today, but since everybody now has a big data problem, a lot more programmers -- of all skill sets and skill levels, and from all fields -- are being required to parallelize their code. Furthermore, because almost all major breakthroughs in science today happen at the intersection between different disciplines, many programmers with little or no formal CS background (and therefore no training on the complex issues inherent in concurrency) are writing code, and need the benefit of automatic parallelization to make use of available CPU resources to decrease runtime. Most programmers in the life sciences end up cobbling together shell scripts as makeshift work queues to launch enough jobs in parallel to make use of CPU resources. It will soon no longer be possible to do biology without doing computational biology, or physics without doing computational physics, etc. -- and it will soon no longer be possible to do computational science of any form without making use of all available cores. The NRC report gave the following advice:

    "Recommendation: Invest in research and development of programming methods that will enable efficient use of parallel systems not only by parallel systems experts but also by typical programmers" [Fuller & Millett 2010, p.99]
     
  1. After the big data era will be “the messy data era”, specifically, the era when the focus of research effort is not how to deal with the quantity of data being generated, but on its quality or qualities. Data types are becoming increasingly diverse, increasingly varied in numbers and types of attribute, increasingly sparse, increasingly cross-linked with other disparate datasets, and increasingly social in nature. We will not be free to attack problems of data quality if the answer to our data quantity problems are “use this library”. The only acceptable answer to the multicore dilemma is, “Build a smarter compiler that will automatically parallelize your code and then get out of the way, allowing programmers get on with more important issues”.
     
  2. Slowdowns in the progress of Moore’s Law will cause bottlenecks in progress in other exponentially-progressing, data-intensive fields: Many fields of science and technology are experiencing exponential growth or progress of some form. Nowhere is this more obvious than in the cost of genome sequencing, which was already decreasing exponentially in cost a comparable rate to Moore’s Law, but began to drop at an even more precipitous rate with the advent of next-gen sequencing techniques [see figure below]. However, within the last couple of years, next-gen sequencing has begun to produce so much data that our ability to store and analyze it has become a huge bottleneck. This is likely to be a major factor in the fast return of the sequencing cost curve back to the usual Moore’s Law gradient, as pictured. All data-intensive fields will be rate-limited by our ability to process the data, and until a solution to the multicore dilemma is solved, the degree of rate limiting will be significantly worse than it should be or could be.
     
DNA Sequencing costs: Data from the NHGRI Large-Scale Genome Sequencing Program. [NHGRI 2012]


The solution to the multicore dilemma: “the sufficiently smart compiler”

It is clear that the multicore dilemma is one of the most pressing issues directly or indirectly impacting all of scientific research today, and that a robust solution must be found as quickly as possible. Paul Graham recently wrote [Graham 2012]:

"Bring Back Moore's Law: The last 10 years have reminded us what Moore's Law actually says. Till about 2002 you could safely misinterpret it as promising that clock speeds would double every 18 months. Actually what it says is that circuit densities will double every 18 months. It used to seem pedantic to point that out. Not any more. Intel can no longer give us faster CPUs, just more of them.

"This Moore's Law is not as good as the old one. Moore's Law used to mean that if your software was slow, all you had to do was wait, and the inexorable progress of hardware would solve your problems. Now if your software is slow you have to rewrite it to do more things in parallel, which is a lot more work than waiting.

"It would be great if a startup could give us something of the old Moore's Law back, by writing software that could make a large number of CPUs look to the developer like one very fast CPU. There are several ways to approach this problem. The most ambitious is to try to do it automatically: to write a compiler that will parallelize our code for us. There's a name for this compiler, the sufficiently smart compiler, and it is a byword for impossibility. But is it really impossible? Is there no configuration of the bits in memory of a present day computer that is this compiler? If you really think so, you should try to prove it, because that would be an interesting result. And if it's not impossible but simply very hard, it might be worth trying to write it. The expected value would be high even if the chance of succeeding was low."
[emphasis added]

Mary Hall et al. [Hall 2009] furthermore observe that “exploiting large-scale parallel hardware will be essential for improving an application’s performance or its capabilities in terms of execution speed and power consumption. The challenge for compiler research is how to enable the exploitation of the [processing] power of the target machine, including its parallelism, without undue programmer effort.”

The “sufficiently smart compiler” cannot actually exist (for modern imperative programming languages)

Functional programming languages are implicitly parallelizable, because calls to pure functions have no side effects. Many efforts to partially or fully automatically parallelize imperative programming languages have been attempted, such as OpenMP for C/C++, with varying success and varying levels of guarantee as to safety of generated code. Generally, the compiler must limit actual automatic parallelization to a small subset of operations that can safely be executed in parallel with a high degree of confidence, but there is a much larger subset of operations that could be parallelized at much greater risk of introducing race conditions or deadlocks into the code.

It turns out to be not just hard, but uncomputable to automatically parallelize an imperative programming language in the general case, because the actual data dependency graph of an imperative program (the graph of what values are used to compute what other values) can’t be known until runtime when values are actually read from variables or memory locations. Any static analysis can only guess at the origin or identity of specific values that might be read from a variable at runtime without actually running the code, hence the uncomputability of static data dependency analysis in the general case. (For example, reducing a program to Single Static Assignment form does not give any guarantees that values won’t change underneath your feet at runtime due to pointer aliasing etc.) Therefore, the sufficiently smart compiler cannot actually exist for modern imperative programming languages.

We propose a completely new programming language model for which the “sufficiently smart compiler” can be created

In spite of the fact that imperative languages cannot be automatically parallelized, the solution is not to focus our efforts on automatically parallelizing pure functional languages, because most “mere mortal” programmers are not able to be productive in functional programming languages. We need to create a new class of programming languages that represents the maximal subset of the imperative programming model that permits automatic parallelization, so that the language “feels” normal to average programmers, but so that it provides safely parallelized code with close to zero effort.

Restated, we must find the axiomatic reason why imperative programming languages are not automatically parallelizable, and produce a language that feels as imperative as possible, but that is properly and minimally constrained so that all valid programs may be automatically and safely parallelized by the compiler.

In this proposal we present a necessary and sufficient definition of a language model that satisfies these constraints, and we propose the development of a compiler to implement it.



References
  1. [Fuller & Millett 2010] -- Fuller & Millett (Eds.), The Future of Computing Performance: Game Over or Next Level? Committee on Sustaining Growth in Computing Performance, National Research Council, The National Academies Press, 2010
  1. [Fuller & Millett 2011] -- Fuller & Millett, Computing Performance: Game Over or Next Level? IEEE Computer, January 2011 cover feature.
  2. [Graham 2012] -- Paul Graham, Frighteningly Ambitious Startup Ideas, March 2012.
  3. [Hall 2009] -- Mary Hall, David Padua, and Keshav Pingali, 2009, Compiler research: The next 50 years, Communications of the AC 52(2): 60-67.
  4. [NHGRI 2012] -- DNA Sequencing costs: Data from the NHGRI Large-Scale Genome Sequencing Program. NHGRI, last updated January 2012.



[1] The multicore dilemma can be defined as the combination of three factors: (1) the so-called “end of Moore’s Law” (as we hit up against several limits of the laws of physics in trying to increase per-core speeds), (2) the wide availability of multi-core CPUs as we begin “the multicore era”, as Moore’s Law continues as an exponential increase in the number of cores, rather than per-core speed, and (3) the lack of good tools, languages and techniques for reliably and easily making use of multiple CPU core resources without significant effort or difficulty.

[2] Examples of current paradigms include MapReduce/Hadoop; the abstractions in a concurrency framework like java.lang.concurrent; various forms of message passing (such as in Erlang) / actors; channels; co-routines; futures; software transactional memory; array programming; etc.

[3] Meaning the demise of what is called Moore’s Law in the common vernacular (that CPU speeds double approximately every two years) rather than the original form of Moore’s Law, which holds that the number of transistors that can be placed inexpensively on a chip doubles approximately every two years. Moore’s Law in its original form should continue to advance like clockwork for the next few years at least, but the same increase in the number of transistors on a die is now primarily due to an increase in the number of cores.

Wednesday, May 9, 2012

Flowlang and Static Single Assignment (SSA) form

I got the following email from a friend today:

I was noodling some source code transforms the other day, and I remembered about flowlang. As I was reviewing it, I wondered: what's the difference between flowlang and Static Single Assignment form for general languages? http://en.wikipedia.org/wiki/Static_single_assignment_form  They seem identical, but I'm probably missing something...

Here's my response:

Good question. SSA form is similar, but it makes only a weak assumption about disambiguation of variables and values, whereas Flow makes a strong (even guarantee-able) assumption about the same.

Basically, in a multithreaded environment, SSA form makes no guarantees that the rug won't be pulled out from under you, i.e. that you know at compiletime what specific value any given variable in the SSA expression refers to. In my opinion, this almost completely demolishes the usefulness of SSA form, because you are reasoning about assumed data dependencies that may, at runtime, turn out to be completely wrong.

SSA is the right way to think, and the nice thing is that all algorithms that have been developed for optimization of code in SSA form can be directly applied to Flow -- the difference is that Flow will always do the right thing at runtime.

It is impossible in the general case to generate SSA form that reflects the runtime data dependencies (rather than the static best-guess approximation of runtime data dependencies) for any general-purpose imperative programming language.

Perfectly determining runtime data dependencies without actually running the program is uncomputable for imperative languages, and even if you run the program, you only get one of a possibly large or infinite number of data dependency graphs, depending on how many race conditions might be present in your code.

Sunday, March 11, 2012

Solving the Multicore Dilemma

The new page Solving the Multicore Dilemma was just added. This explains the thinking behind Flow in a much more readable way than the original Flow Manifesto.

Friday, February 24, 2012

Flow and KNIME

Long-term I want to build an IDE for Flow that lets you build data pipelines graphically, and runs the code incrementally in realtime as you edit, re-running nodes only downstream of any changes you make. I also want to make Flow support user-definable views for different data types, so that you can visualize data moving through the pipeline in different ways. I had a big plan for how this was all going to fit together, but it turns out somebody has already beaten me to it in the form of KNIME (at least for the "big picture" of how the IDE would work) -- and they've done a beautiful job of it. (Note that Flow still has a much broader goal of solving the implicit parallelization problem in the general case, but KNIME at least implements a very flow-like IDE for handling and visualizing big data pipelines.)

http://www.knime.org/
http://www.knime.org/features
http://www.knime.org/screenshots

KNIME lets you build a data analysis pipeline, complete with data normalization and filtering, inference/classification and visualization steps. It caches data at each node in the workflow (so changes to the pipeline only result in the minimum necessary recalculation), and keeps track of which experimental variables produced which results. It intelligently makes use of multiple cores on your machine wherever possible. It incorporates the entire Weka machine learning framework. It lets you add your own visualizers for different data types. It cross-links the highlighting of data points between different tables and views, so that if you select a data point in one view, it selects it in all other views. It reads and writes a large number of different data formats and can read from / write to a live database. You can call out to R at any point if you have existing R code you need to run on a piece of data.

i.e. KNIME basically does everything that anybody who works with data does every day, and keeps everything tied together in a nice workflow framework with built-in data visualization, smart caching, smart parallel job launching etc. etc.

Thursday, May 26, 2011

DAGs, partial orderings, program lattices

Program lattices, partial orderings and directed acyclic graphs are *the* critical foundational constructs that Flow is built upon, so I will attempt to describe here what these things are, why they matter to programming languages, and how they fit in the design of Flow.

Difference between dataflow programming languages and Flow

Dataflow programming languages (where you can build a program as a "circuit diagram") are overly complicated. Flow is very different from most other dataflow programming languages, you can't just connect a dataflow graph however you want in the style of old-school flow charts. Flow charts died out for a reason: they produce spaghetti code. Flowcharts -- and, by derivation, most dataflow languages, allow the user to make almost arbitrary connections between any part of the chart/graph and any other part of the graph, including introducing overlapping inter-nested loops of arbitrary complexity. This is directly correlated with the GOTO keyword in BASIC that died a just and horrible death.

For the purposes of describing the control flow of a program, old-school flowcharts were briefly superceded in the 90s (at least at my university) by a second generation of control flow diagramming tools: hierarchical program flow models, which were like a UML hierarchical call chart, only oriented vertically: you would show the program structure in a series of nested function calls, drawn as a tree. In effect, this enforced pure hierarchical containment of control flow, i.e. it forced you to design a control flow for your code that was in essence purely functional in style.

The Flow model introduces a third paradigm, that of program lattices. A lattice is a partial ordering and a DAG by definition, as described below.

(1) Directed Acyclic Graphs (DAGs)
http://en.wikipedia.org/wiki/Directed_acyclic_graph

Definition

Loosely defined, these are basically like a forest of rooted trees (meaning trees with arcs/edges directed away from the root) whose nodes may branch out but whose branches may also join together (coalesce) as you move away from the root.

More formally, and by very definition, a DAG is an arbitrary directed graph that does not possess a cycle: there is simply no way to follow a path from any start node and end up back at the same start node again after a nonzero number of moves. (Note that if you take away the directionality of an arc, turning it into an undirected edge, the graph may indeed possess cycles, but in the directed form.)

Implication for programming language design

Most imperative programming languages conflate the concept of variables and values: the action of reading from a variable returns its *current value*, and because (except when working with constants) imperative languages primarily concern themselves with variables, the actual values those variables hold cannot be known at compiletime. Consequently, the data dependency graph in an imperative language (what needs to be computed before something else can be computed from it) is also a graph of dependencies between variables, not values, and it is entirely possible to create directed data dependency cycles in your program, for example if you have the statement, "i = i + 1" inside a loop.

The problem with this is that underlying the variable data dependency graph (the "compiletime data dependency graph") is something that might be called the true "data operand graph" (or the "runtime data dependency graph"), and this is defined by values, not variables. If you were to record exactly what happens during a specific run through the program, every time a variable is updated, you could create a new "value node" in the graph, and when the current value of a variable is read, you would add an edge to a graph that points to the specific value node that corresponds to the current value of the variable.

In this model, it is easy to see where race conditions come from in a multithreaded context: if another thread pushes a value into a variable at an unexpected time, this actually changes the target node in the underlying "data operand graph" for the edge that corresponds to the the next read from the variable. In other words, the entire connectivity of the true underlying data dependency graph is at the mercy of the relative timing of reads and writes.

However if we remove the concept of "the current value of a variable" from the programming language, and replace it with "the one and only value this variable will ever take on" (immutability) or "the value of this variable at a specific timestamp" (syntactic sugar that allows aliasing of variables with a parameter that means something to the user and the compiler, e.g. "x(t) = x(t-1) + 1"), then this problem disappears.

In fact it turns out that the only compilable programs in a language with this constraint have compiletime and runtime data dependency graphs that are exactly the same, and that are DAGs, i.e. the language is syntactically constrained such that all valid programs have variable reference graphs that *are* the precise runtime data dependency graphs.

Functional programming languages already satisfy this, because all variables actually represent immutable values, but the requirement that the program's structure must exactly describe the runtime data dependency graph is actually in some ways a weaker condition than that imposed by functional programming, because it's possible to support a subset of imperative-style "scatter" computation in the DAG model whereas the functional style only really describes "gather" computation.

(2) Partial orderings
http://en.wikipedia.org/wiki/Partial_ordering

Definition

A partial ordering is a DAG where the directions on the arcs/edges follow the arrow of time, in other words where A -> B indicates "A happens before B" (or, as a dependency graph, A <- B indicates "A must happen before B can happen").

Assuming that time moves downwards, and given the following graph,

  A
 / \
B   C
 \ /
  D


we have that A must happen before B and C, and B and C have to happen before D. However, this graph says *nothing* about the relative order in which B and C must happen: B can happen before C, or C can happen before B, or both can happen concurrently.

Implication for programming language design

Direct examination of a partial ordering can yield a concurrent execution plan for a parallel program: if there is no directed path between two nodes that passes through zero or more other nodes in a partial ordering of work units to be completed, then the work units corresponding to the two nodes can be completed in parallel.

If your programming language has syntactically enforced that every program's reference structure is a DAG, then every program is also a partial ordering, and by examining the source alone, a simple parallelization plan can be determined in many cases.

(3) Lattices
http://en.wikipedia.org/wiki/Lattice_(order)

Definition

A lattice is a partial ordering with a single unique "top element" (least upper bound) and a single unique "bottom element" (greatest lower bound). Every node in the lattice is reachable by following some directed path from the single least upper bound, and there is a directed path from every node in the lattice to the single least lower bound.

Implication for programming language design

In a programming language, the "top element" could be thought of as the start state of the program, and any non-unique greatest lower bound indicates some value that was computed but not used before the program terminates at the "bottom element", and therefore may be pruned as an unused value. Program DAGs pruned of unused computation become lattices. Hence if the compiler is doing its job, a Flow program will really be a lattice and not just a DAG.

Lattices in Flow

In Flow, a node of a DAG typically represents a collection, and the arcs coming into a node represent the data dependencies required to compute the contents of that collection (as well as, collectively, the operation that actually computes the contents of the collection based the values of its dependencies).

Arbitrary conditional control flow is specified in Flow by conditionally nesting/embedding sub-lattices within a bigger lattice (i.e. nodes in lattices can be recursively expanded into entire nested lattices -- basically this is like recursive function application).

Because, with program lattices, there are no cycles or back-links that violate the arrow of time, it is impossible in Flow to create spaghetti "circuit diagrams", or there are no arbitrarily inter-nested loops (i.e. no GOTO equivalent).  However, you are also not forced into the pure functional mindset that emerges from requiring control flow to be purely the result of hierarchical/tree-like calls.

Structuring programs as lattices is not new per se, it's implicitly used in a weak form in "array programming" (performing elementwise operations on collections in parallel), but lattice-based computational models are not syntactically enforced elsewhere, and (it appears) are not yet used as the most important principle driving the design of programming languages other than Flow.

Saturday, May 7, 2011

On ordering and properties of domains and maps

Here is some of the thinking behind the "push" operator "->".  It basically maps to a "parallel-add-to-collection", ConcurrentCollection.add() or similar. The question was raised on the flowlang mailing list recently as to what the type was of this operator. There's quite a lot going on with type inference in Flow so it's worth highlighting those specific concepts in one place.  Also perhaps by rewording this, it may come across clearer than in the Flow Manifesto.

To recap relevant background:

Morphisms, domains and co-domains

Algorithms can basically be broken down into morphisms (elemental function applications) and control flow (loops/conditionals: the logic that connects data flowing into and out of morphisms together in some sort of fixed or dynamic graph). If the data dependency graph is viewed as a pipeline with DAG topology, then data can be pushed through the pipeline continuously but data that is all somehow mutually associated needs to be pushed through in "waves", and this gives rise to the need for collections at the nodes, not just single values.

[A morphism is fundamentally a map from a domain to a range, or a domain to a co-domain if you are talking about only the values in the range that are actually mapped to from the domain.  A specific co-domain is basically a collection (set or list, not a HashMap or other map type), and a morphism is basically the same thing as a function if it maps from a domain to an unknown range, or a map (HashMap or similar) if it maps from a finite domain to a specific co-domain.]

Flow tries hard to unify functions and maps -- and this makes a lot of sense, because in many functional programming languages, functions are "memoized" which basically means you cache a function's values in a HashMap to avoid having to re-compute them where possible. (Eventually we'll blur the line between what's computed at runtime and what's computed at compiletime, and some functions may be partially or wholely evaluated at compiletime and be transparently turned into a precomputed map, but that's a subject for another post.)

Tracking properties of variables

Every typed language keeps track of some information about variables (whether a variable represents an integer or a float, etc.), but Flow will take this much further, e.g. if a variable x can only represent an integer in the range [0..49] inclusive because it is the index of an item in an ordered collection of fixed constant size 50, then everywhere x is used, other things that are calculated from x have their own domains calculated relative to this number range.  e.g. if you set y = x/2 (using integer division), then y's range will be between [0..24] inclusive.

Tracking domains in close detail allows for some really powerful language features: for example, the isqrt() integer square root function might require a parameter that is an integer in the range [0..], so if you pass in an integer with unknown range or with known range that includes negative numbers, you'll get a compiletime error. No more division by zero etc. -- and you can even be warned about possible overflow, underflow, NaN situations etc. at compiletime.  (It will take some experimentation to figure out how pedantic to make this without annoying the programmer in some cases with variables whose value isn't known at compiletime.)  Also this can be extended to other types to eliminate NullPointerExceptions: you will know if it's possible for a reference's value to ever possibly be null just by looking to see if the "exceptional value" #Null is in the domain.

Tracking properties of collections and maps

The really powerful things that Flow will track are the properties of collections (co-domains, i.e. the specific collections that are produced by morphisms) and maps (morphisms).  This is where a lot of the magic will happen in Flow.

Some of the properties that will be tracked for collections include: orderedness; countability/iterability; sparseness; finiteness; whether or not duplicate objects are allowed.

Some of the properties that will be tracked for maps include: surjectivity; injectivity; bijectivity.

A couple of examples of traditional data representations that map to these:

(1) An array is a map from an ordered, dense, finite, iterable integer domain (the array indices) to a set of unordered values (the set of values in the array). (It is possible to iterate through the keys and pull out the values in array index order, but only if needed; if you don't need the values in a specific order, you can just consider the values in unordered set form).

(2) A sparse named array is a map from an ordered or unordered, sparse, finite, possibly iterable domain to some set of values.

(3) A map or morphism type could be created to read in lines from a text file, and the domain would be line numbers and the range would be lines of text.

Orderedness of domains

In most languages, sorting a collection is a manually-initiated operation: you start with an ArrayList<String>, for example, and you call Collections.sort() if you want that collection sorted.  There are also usually special collections that keep their elements sorted: for example, priority queues and TreeMaps.

In flow, orderedness is a first-class attribute of a collection. Sorting is not something you "do" in almost any case, it's simply the property of a domain or collection. And wherever possible, if orderedness is not required, collections are left unordered.  Collections are treated as unordered by default.

The reason for this is that there are a few basic algorithmic patterns that you see arising again and again: maps, folds and filters to name a few. Map operations are elementwise so don't impose any ordering constraints.  Folds only impose ordering constraints if the function being applied is not commutative (and they only impose "divide-and-conquer"-type grouping constraints if it's a parallel fold and the function being applied is not associative).  And, interestingly, filter operations only require their input to be ordered if something that depends on their output requires that the output is ordered relative to the input -- so the need for orderedness propagates lazily back through filter operations.  (The need for orderedness also propagates lazily back through map operations, if the output of the map is (key,value) pairs and not just values.)

Orderedness of domains and implicit parallelization

In fact, most sorting incurs overhead of some sort, but, worse than that, saying that you want a sorted collection when you don't ever apply some sort of non-commutative operator to the collection just unduly constrains the compiler from being able to find good parallelizations of your code.  So Flow will find the places that really need sorted data, and back-propagate the requirement as far as necessary up the data dependency graph to find other things that need to be ordered to produce output in the right order.

For example, if you are writing data out to a file, the method for writing lines of data will require an ordered collection, because the act of serializing elements of a collection is a non-commutative operation.  If you try to pass in an unordered collection of lines to be written to a file, you will get a compiletime error. This is how the compiler statically guarantees that operations like this are threadsafe. (Note however that it is possible for an operation to be non-commutative but still associative, for example string concatenation -- in this case, lets say you have 80 strings to concatenate into one string, and you have 8 threads, the compiler knows it has to keep all the strings in order relative to each other, so it requires the input be ordered, but it doesn't have to run the concatenation in series, it can split the list of 80 strings into 8 pieces containing 10 strings each, one per thread, and combine the results at the end to produce a single string. The compiler can transparently generate skip-lists or similar data structures under the hood to allow this sort of work splitting to be performed quickly.)

There could be several ways to specify that a non-commutative operator should first order a given unordered collection in an appropriate way.  One way would be to specify a comparator or "order by" clause, and this would produce a "view" of the collection that is remembered by the compiler, and reused without re-generating if the same ordering is used elsewhere in the program.  Another method would be to specify that the natural ordering of the elements in the collection should be used (e.g. lexicographic order for strings, increasing size for numbers).  Another method would be to have a function that takes a map with naturally-orderable keys and unordered values, and produces a list of values in the corresponding key order (this would be useful for dumping out values in arrays in order, for example) -- but I want to avoid this where possible because it implies launching a sort operation manually. I think the compiler should almost always make the decision as to what should be sorted and when, to minimize sorting wherever possible, again because this not only incurs overhead but it reduces parallelizability.  *Everything* in Flow should be designed with maximum parallelizability as THE driver.

Back to the "->" push operator

Back to "->": The reason specified in the previous post that "->" would push values into an unordered collection by default is mostly that it allows for maximum degrees of freedom in parallelizing.

Also hopefully the above description begins to answer questions about type inference for push operations.  The type of the target of a push operation is simply a set of elements whose type is the same as the type of the values being pushed into the collection.

Note that you can push into the same collection from multiple places in your source, and the compiler will try to unify the types that are being pushed from different places. If they are not unifiable, you will get a compiletime error.

As mentioned above, any requirement for orderedness is back-propagated by the compiler from operations that require orderedness because of non-commutativity.  So it is possible that ultimately the push target will end up ordered.  In that case, a sort or "shuffle" phase (to use Google's MapReduce terminology) will automatically be inserted by the compiler, or the collection will be turned into an ordered collection like a TreeSet, depending on what is most efficient.

Monday, May 2, 2011

Flow syntax and semantics

I'm cross-posting the following from the "flowlang" Google Group, with minor modification.

--

It might be helpful to think about the creation of a syntax for Flow that maps onto the semantics described in the Flow Manifesto as follows:
  • Take a very simple purely functional programming language syntax -- maybe a subset of Haskell, or a functional subset of Python.
  • Add one operator, a scatter/push operator, say "->", that gives the language more of an imperative feel.
  • [You can also add timestamps, e.g. x(t) = x(t-1) * 2. These also give the language more of an imperative feel, but I won't talk about those here for brevity. They're just basically a convenient way of aliasing immutable variables that has semantic value to both the user and the compiler.]
Then assignment, "=", is a pull / gather operation: y = f(x) pulls a value from x and applies f then pulls the computed value into y as expected.

The way that "->" operates is that it pushes / scatters values to locations, which is typical of the imperative style of programming. For example, take the following Java code:

int[] ys = new int[] {1, 5, 2, 9, 1, 1, 4, 2};
int maxVal = 0;
for (int i = 0; i < ys.length; i++)
    maxVal = Math.max(maxVal, ys[i]);
int[] hist = new int[maxVal + 1];
for (int y : ys)
    hist[y]++;


You would do something like this in Flow (making up the syntax):

ys = {1, 5, 2, 9, 1, 1, 4, 2}
for y in ys
    1 -> ones[y]      // each entry ones[y] is a set of integers (with value 1)
hist[i] = sum ones[i] // implicit iteration: "for all i", i.e. hist = map sum ones


This histogram example is one I keep coming back to in my own mind because it's a common and minimal testcase for the "push" model of programming. Basically you're scattering counts into a histogram at indices corresponding to list values, which is not in general threadsafe and is therefore not parallelizable in imperative languages without extra work.

In the Java case I just incremented the histogram values directly; in the Flow case I output a stream of 1s and they were then collected by running a "map" operator over the collection ones[i].  The key thing to realize here is that ones[i] is constrained (by being the target of the "->" operator) to be an unordered collection, because otherwise you would get a race condition if you were pushing different values that had to stay in order relative to the input. It doesn't matter here whether it is ordered or not, of course, because everything that is getting pushed is a 1 -- but in more complicated cases it can matter. (If you force the target of a "->" operator to be ordered, then the compiler can still parallelize, but it will have to do some extra work to keep things in order.)

For smaller lists, i.e. ys.length < L, it can just run this as a single-threaded program.  For larger lists, the compiler is free to parallelize this in a lot of different ways, for example:
  1. It put a lock on each bin, ones[i], to prevent race conditions.
  2. By realizing that the "sum" operator is just folded addition, and by realizing that addition is associative and commutative, the compiler can build one copy of the "ones" array-of-sets in each thread's TLS (thread local storage), and then combine these separate copies at the end.
(The second version has higher memory requirements but is lock-free until the end. It incurs a bit more time at the combine stage. This can all be calculated as a Big-Oh complexity function of the input size, input value distribution, etc., and the compiler can switch between implementations as needed.)

There are other optimizations that can be performed, e.g. the compiler can perform something similar to "hoisting" by realizing that you're just pumping the ones through a set collection and then into a sum function, and that can be converted to an increment operation.

UPDATE:

The "->" operator should be able to support arbitrary MapReduce-style mapping, scattering, shuffling, grouping by key and reducing.  For example, let's say the "y in ys" actually represents an id for a Person record of some form, and you want to group everybody together who has the same first name, you should be able to do something like the following:

for y in ys
    p = persons[y]
    p -> firstNameGroups[p.firstName]

It might be more helpful in general to extend "->" to take (key,value) pairs, and group by keys:

for y in ys
    p = persons[y]
    (p.firstName, p) -> firstNameGroups

...or, for the histogram example:

for y in ys
    (y, 1) -> ones