[cpp-threads] Asynchronous Function Proposal

Robison, Arch arch.robison at intel.com
Wed Jun 3 02:09:23 BST 2009


You have to pick your poison.  Cilk and TBB make different choices on that.  I call them "steal parent" versus "steal child".  

In Cilk, spawning a child task causes the current thread to immediate execute the child, and the parent is put in a suspended state where it can be stolen by an idle thread.

In TBB, spawning a child task puts the child in a suspended state where it can be stolen by another thread, and the parent continues execution.  TBB can simulate Cilk "steal parent if the programmer writes in explicit continuation-passing style, which is not easy.

The big wins from "steal parent" in Cilk are:
* Threads never wait for completion of tasks, because the thread that finishes the last child picks up execution of the joining parent.
* Total space is (number of threads)*(space for sequential execution)
The big drawback of Cilk is that is requires a non-standard calling convention that has cactus stacks and decouples stacks from threads.  As Hans points out, "steal parent" causes weird things to happen with existing implementations of thread-local storage.

TBB's "steal child" avoids the drawbacks.  Thread local storage works and standard calling conventions can be employed.  But this comes at the expense of not having the Cilk guarantees on space and incurring waiting.  TBB uses heuristics to control space demands, and the waiting sometimes artificially constraints parallelism.  These heuristics usually work, but like any heuristics, they occasionally bite us.  TBB also sometimes launches tasks on threads that are executing a waiting task, which helps with efficiency in common cases, but has its own risks.  There's no free lunch here.

I suggest that for anyone prototyping the proposed asynchronous call functionality, that they try the parallel_sum example over 10^9 values.  That yields a recursive tree of about 10^6 nodes, with a height of 20.  With Cilk, the space demands will be on the order of about P*(20 nodes).  In TBB without the heuristics, it could go quadratic in the tree depth, e.g. about 400 nodes.  The TBB heuristics keep it down to P*(20 nodes), but by sometimes limiting parallelism.  With any prototype of the proposal, I would want to know how much space does it consume, and how well does it scale?

It's unfortunate that we are stuck with calling conventions that arose in a sequential milieu.

- Arch

-----Original Message-----
From: Boehm, Hans [mailto:hans.boehm at hp.com] 
Sent: Tuesday, June 02, 2009 7:08 PM
To: C++ threads standardisation; Lawrence Crowl; Bjarne Stroustrup
Cc: Pablo Halpern; Robison, Arch; Doug Lea
Subject: RE: Asynchronous Function Proposal 

[Adding Pablo, Arch, and Doug:  This is a continuing discussion of Lawrence Crowl's proposal for a simple C++ facility to support asynchronous function call execution.  Reattaching Lawrence's proposal for newcomers.]

> -----Original Message-----
> From: Herb Sutter
> 
> > But I'm still worried about whether we have enough of the details 
> > right here that this facility will actually be usable.  I 
> don't have 
> > the necessary experience that I feel I can evaluate a 
> design.  As far 
> > as I can tell:
> > 
> > - I believe this doesn't support a Cilk-like (or TBB-like?) 
> > implementation.  I believe Cilk effectively allows colder 
> call frames 
> > to be stolen by another thread, for example.
> [...]
> 
> A primary reason I've been advocating an "async()" facility, 
> besides as a way to make async calls easy, was specifically 
> to make sure that an async() with a default "could execute on 
> this or another thread" policy could be implemented to run on 
> a work stealing implementation, such as Cilk, TBB, or our own 
> TPL/PPL/ConcRT. I haven't seen anything that would prevent 
> that -- all you need to say is exactly "could execute on this 
> or another thread" and IIUC that's all we need.
I'm not an expert on this stuff.  (Copying Pablo in case he can help and hasn't seen this.) But my understanding is that when Cilk is about to execute a spawned task, it pushes a work item on the task queue that represents the parents' continuation.  That pushed item may get stolen by another worker thread.  Thus the parent thread may appear to have migrated between threads.  My vague intuition is that this is necessary if you want to both:

- Efficiently execute child tasks on the parent thread in the common case, and
- Be reasonably sure that when you steal a task, you're likely something reasonably large, from closer to the root of the tree.

If I understand this correctly, this doesn't mesh particularly well with a bunch of other thread facilities in the current draft, e.g. thread-local variables.  (There are more details on Cilk in their PLDI 98 paper http://supertech.csail.mit.edu/papers/cilk5.pdf .)

TBBs tasks are more complicated to use, since they don't rely on language support.  They may avoid this kind of issue; I'm not sure. (Copying Arch in case he can help.)

My impression is that for both Cilk and TBB, interactions with other C++0x thread facilities haven't really been worked out.  (Not to surprising, since we're having enough problems with interactions between C++0x facilities, not to mention weird interactions within older systems we're building on.)

Hans




More information about the cpp-threads mailing list