[cpp-threads] Asynchronous Function Proposal

Wed Jun 3 15:47:31 BST 2009

[FYI, I'm on the mailing list, but I'm out of office until Friday, so my replies have long latency.]

I agree that having a benchmark that avoids compiler optimizations that go beyond O(1) improvements is important.  I would drop Fib the moment I suspected a compiler was linearizing it.  (For that matter, I wrote an FP interpreter a long time back that would linearize it via memoization.)  Doug's numerical integration example looks like a good benchmark for this purpose.

- Arch  

-----Original Message-----
From: cpp-threads-bounces at decadentplace.org.uk [mailto:cpp-threads-bounces at decadentplace.org.uk] On Behalf Of Doug Lea
Sent: Wednesday, June 03, 2009 7:46 AM
To: C++ threads standardisation
Cc: Pablo Halpern; Bjarne Stroustrup; Lawrence Crowl
Subject: Re: [cpp-threads] Asynchronous Function Proposal

Andrei Alexandrescu wrote:
> Robison, Arch wrote:
>> Instead of parallel_sum, the usual parallel naive Fibonacci
>> recurrence would be better than parallel_sum.  On typical machines,
>> parallel_sum runs out of bandwidth.  Native parallel Fibonacci makes
>> a good benchmark for recursive tasking because it is not sensitive to
>> machine bandwidth and generates unbalanced trees of work.

Fibonacci has a problem as a benchmark because
optimizing compilers are allowed to discover the
recurrences that lead to O(n) (not O(2^n)) solution.
I've seen some compilers do the beginnings of this.
Just unfolding one layer linearizes fib(2) calls.
Which makes results of parallel versions hard to interpret.

When trying to do some cross-language/compiler/framework
comparisons once, I threw together a stripped down
divide-and-conquer numerical integration test.
I just put some versions at
   http://gee.cs.oswego.edu/dl/code/integrateTests/
Numerical integration has the same overall virtues
of fibonacci -- no big data arrays, and quick (although
not as quick) per-call computations, but is less amenable
to compiler optimizations that make results uninterpretable
when comparing across languages, compilers, and frameworks.
(The integrate.cpp version includes an optional comparison of
partial accumulation loop vs pure recursion to help determine
if compiler-based unfolding is making a big contribution.
It seems not to using gcc or suncc.)

It would be better yet to use a function without a
known analytic integral, but doing this adds enough
per-task overhead to cloud results, so this one just
integrates a cubic function.

I didn't include a c++0x parallel version. Or TBB.

The Java version needs jsr166y.jar (containing
upcoming ForkJoin stuff) available at
http://gee.cs.oswego.edu/dl/concurrency-interest/index.html
http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166y.jar
BTW, As is almost always the case, the version (run with "d")
that uses demand-sensing to decide whether to fork vs recurse
is generally fastest.

-Doug

--
cpp-threads mailing list
cpp-threads at decadentplace.org.uk
http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads