[cpp-threads] Asynchronous Function Proposal

Thu Jun 4 02:51:13 BST 2009

I agree with Hans' summary of the situation, except for the bit about the "only spawns one of the subcomputations" part.  I've interspersed my comments below.  

>-----Original Message-----
>From: cpp-threads-bounces at decadentplace.org.uk [mailto:cpp-threads-
>bounces at decadentplace.org.uk] On Behalf Of Boehm, Hans
>Sent: Wednesday, June 03, 2009 5:53 PM
...
>I just want to be sure we're not designing a white elephant here.  If we
>include something that inherently works less well than Cilk or TBB (or
>Microsoft's library) for nested parallelism, I suspect that those
>alternatives will be used instead for most applications.  If that's the
>case, I think we either shouldn't bother at all, or limit ourselves to the
>simplest possible (always runs in another thread) variant.

It is important to get this right.  It's impractical to build good higher level concurrency/parallelism on top of a foundation that is poor.  It seems to me that the issue really is parallelism, because if there were no multi-core processors, I doubt that the issue would be getting nearly as much attention as it is getting.

>- We have a pretty clear answer that we cannot handle a Cilk-style
>implementation without a lot more work; stealing callers effectively means
>that tasks have to migrate between threads, implying that you may lose your
>thread-locals, etc.  

There is theoretically a way to get the effects of "steal parent" without migrating callers, if for P hardware threads there are P^2 software threads, and the software threads are carefully managed in a way that only P are running.  Choosing which subset to run at any moment is the tricky part, and the overhead for spawning is probably not cheap.  The basic gimmick is that "spawn" is implemented by suspending the parent thread and running the child on a different thread.  "Steal parent" implemented by resuming parent thread.  The trick part is reusing suspended threads to do other work in just the right way to get the P^2 bound while retaining Cilk properties.  I don't know of any practical implementations of the theoretical result, but it might provide some useful leverage for implementations, particularly for small P.

>If I understand correctly, committing to not doing
>that is fine, since code has to be written somewhat differently in the two
>styles.  For example, a fib function that only spawns one of the
>subcomputations and runs the other inline is potentially parallel in Cilk,
>but not the others.  Thus a standard that left the execution method
>completely unspecified seems to give too little guidance anyway.

The "steal child" systems do permit parallelism in that situation.  The child runs concurrently while the parent does the inline evaluation.

There are definitely differences in how things can be written.  In Cilk, it is okay to write something like:
	for( int i=0; i!=HUGE; ++i )
         cilk_spawn f(i);
and yet it will not create HUGE number of tasks simultaneously (indeed, for P cores, it will have only P tasks in existence simultaneously).  With "steal child" systems, the implementation must do some kind of throttling as mentioned below.

>- We could possibly use an implementation like TBB or Doug's Java one?  

I think this is doable, though given some of the differences in the spec from TBB (particularly with respect to thread reuse), I'm uncertain what the performance characteristics will be.  A prototype would be needed for that.

>I think this REQUIRES Lawrence's original formulation in which the decision
>to run a task in the same thread is postponed until the result is needed,
>so that it's available for stealing in the interim.  I don't see how to
>make this work if ths ubtask is run at the point of call.

Right, deciding at the point of call is generally a poor policy.  Strictly requiring that the decision be postponed until the result is needed also creates problems, because the system might need to force evaluation if it runs out of space to store spawned tasks.  Of course the thread can block until another thread steals a task, but then the thief may face the same problem.  

I would suggest permitting latitude to either postpone the decision to run on the same thread until the result is used (thus permitting theft), or immediately evaluate the call on the same thread (as a "last chance" way to throttle space).  

>- One difference between these two is that the latter probably requires the
>programmer to more explicitly manage granularity.  Cilk naturally tends to
>run small tasks near the leaves as function calls, minimizing overhead for
>small granularity.  It still seems to me that with the other alternatives
>are probably less robust against too small a granularity?

It is true that "steal parent" has better granularity properties, though my impression is that it's still not so fine that programmers can always ignore granularity issues.  Granularity seems to be an inherent issue in parallel programming in general.  For that matter, it's inherent in serial programming.

- Arch Robison