[cpp-threads] Visibility question

Wed Aug 2 20:23:34 BST 2006

I also think Hans' example should be a race, and I'll mention the reason
responding to this other subthread below:

Hans wrote:
> > From:  Peter Dimov
> > Do you know of an architecture/platform where <ordered>
> > atomics aren't (going to be) fences?

I assume this question is asking whether it's a global property (i.e.,
two fences in two threads cause some synchronization between threads),
as opposed to write-read of the same variable?

My current bias is that only write-read of the same variable would
induce an ordering edge, and that two threads that read and write
disjoint sets of variables should have no relative ordering. My "gut
feel" is that this local vs. global effect just feels like a fundamental
property you'd want or need for good composability, and that of course
it doesn't matter for the uniproc case but it will matter increasingly
with #cores (and I suspect global properties like this will be
increasingly less likely to be supported by real hardware as #cores
grows -- the headaches start even with low #, as soon as 4- and 8-core,
where designers will naturally start to think about private shared
caches for pairs or quads of cores).

I would be very interested in understanding any arguments for why
threads reads/writes of different variables should imply any ordering
between threads, if that's what being proposed. Hardware might or might
not happen to do it (today), and I doubt hardware can be relied to do it
tomorrow. But most of all I'm not sure why it should be a guarantee the
programmer is allowed to rely upon... my "gut feel" again here is that
any code that is correct iff we give that guarantee will at best be
extremely fragile, and more likely the programmer doesn't actually
understand the code and/or didn't intend to write it in the first place
(e.g., forgot a lock).

> I don't know of any such hardware platforms.
>
> They probably wouldn't be fences on a software DSM platform.  (There
are
> arguments about whether we should care.  Give that memory latencies
and
> minimum packet latencies over a network seem to be getting
surprisingly
> close, I'd be inclined to say that we should.)
> 
> The other general concern, recently also expressed to me by David
> Callahan, is that under the right conditions, it should be possible
for
> the compiler to merge excessively fine-grained threads, and remove the
> synchronization between them.  This argues that synchronization
> operations should not have global implications beyond the threads the
> threads that access the synchronization object.  

This is going to be an essential property. In particular, I believe that
most successful next-generation concurrency programming models will rely
on letting the programmer express lots of latent concurrency that can be
efficiently mapped down to the actual number of cores (else CPU-bound
apps won't be able to scale with hardware). Some of this will be done
using regular SIMD styles like OpenMP and map-reduce that hide the
details, but some will be in explicitly expressing lots of little
different asynchronous tasks (e.g., fine-grained MIMD, active{} blocks
that could be as small as active{ new T }, the fine-grained concurrency
in pure functional languages), and it is important that the system be
able to combine them to run well on low-core systems and let them run
more independently on high-core systems. If we can't combine them
easily, we'll have to somehow tell programmers not to write fine-grained
asynchronous operations for some definition of "fine-grained" that will
be hard for programmers to get right.

Herb