[cpp-threads] Asynchronous Function Proposal

Wed Jun 3 14:48:36 BST 2009

Robison, Arch wrote:
> You have to pick your poison.  Cilk and TBB make different choices on that.
> I call them "steal parent" versus "steal child".
> 
> In Cilk, spawning a child task causes the current thread to immediate execute
> the child, and the parent is put in a suspended state where it can be stolen
> by an idle thread.
> 
> In TBB, spawning a child task puts the child in a suspended state where it
> can be stolen by another thread, and the parent continues execution.  TBB can
> simulate Cilk "steal parent if the programmer writes in explicit
> continuation-passing style, which is not easy.

Roughly the same holds for Java7 ForkJoin. A few of the internals
are described in some talk slides at
   http://gee.cs.oswego.edu/dl/cpjslides/fj-mar09.pdf
The main difference from TBB is the streamlining enabled
by reliance on high-performance generational GC. (It spews
massive amounts of garbage.)
I think the .Net Task Parallel Library is also roughly similar.
See http://research.microsoft.com/apps/pubs/default.aspx?id=77368

> 
> The big wins from "steal parent" in Cilk are: * Threads never wait for
> completion of tasks, because the thread that finishes the last child picks up
> execution of the joining parent. * Total space is (number of threads)*(space
> for sequential execution) The big drawback of Cilk is that is requires a
> non-standard calling convention that has cactus stacks and decouples stacks
> from threads.  As Hans points out, "steal parent" causes weird things to
> happen with existing implementations of thread-local storage.

Right.

> I suggest that for anyone prototyping the proposed asynchronous call
> functionality, that they try the parallel_sum example over 10^9 values.  That
> yields a recursive tree of about 10^6 nodes, with a height of 20.  With Cilk,
> the space demands will be on the order of about P*(20 nodes).  In TBB without
> the heuristics, it could go quadratic in the tree depth, e.g. about 400
> nodes.  The TBB heuristics keep it down to P*(20 nodes), but by sometimes
> limiting parallelism.  With any prototype of the proposal, I would want to
> know how much space does it consume, and how well does it scale?

The parallel-prefix operations in our (optional, non-Java7 "extra166y")
ParallelArray application classes use the cool-but-esoteric
manual-continuation approach in part for this reason, but also
to allow a bit more asynchrony, that helps a lot on machines
with relatively small numbers of processors.
(For the curious, these are currently hiding in class FJScan deep inside
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/extra166y/PAS.java?view=log)
FWIW, my general attitude is that any algorithm likely to hit
quadratic bounds is also likely to be suboptimal on other grounds.
Even given the years of existence of lightweight task frameworks,
there is still lots of room for discovering better algorithms using
them.

> 
> It's unfortunate that we are stuck with calling conventions that arose in a
> sequential milieu.
> 

And in turn zillions of things relying on them (like thread-locals).
Given all this, it is not at all clear to me that adding
support for continuation-style passing is a big win unless
you fully isolate it from other thread-based programming.

-Doug