[cpp-threads] Alternatives to SC

Fri Jan 12 20:53:36 GMT 2007

> -----Original Message-----
> From: Paul E. McKenney [mailto:paulmck at linux.vnet.ibm.com] 
> 
> This is in fact what I understood.  I have deep reservations 
> about legislating SC even for specially marked atomics.
> 
> To see the basis of my concerns, please consider a 
> generalization of the example that Peter Dimov recently 
> posted to this thread (with your
> corrections) from 4 threads to N+2 threads.  Please 
> especially consider the case where the two readers are 
> running on different dies, but each sharing dies with 
> different writers.
As Sarita has pointed out in other discussions, this basically requires
that a reader not return another processor's write until all processors
have seen the write.  Many current cache coherence protocols, including
NUMA ones, do ensure this.  We have also seen a bunch of papers arguing
that you can largely hide any additional latencies from this sort of
thing with sufficient hardware speculation.  And, at least
theoretically, this only has to apply to specially marked loads and
stores.

I will certainly admit that I don't fully understand the hardware issues
here.  But I haven't seen particularly clear, confident, or consistent
answers from the experts as to the costs involved.

> 
> > We do clearly have sufficient hardware support to support 
> such as SC 
> > atomics adequately on some platforms.  (Amusingly, Alpha is 
> probably 
> > one of those?) We have issues, or at least lack of clarity, 
> on others.
> 
> I don't have access to an Alpha to test this out.  Can Alpha 
> really be made to fully order independent writes to 
> independent variables across all CPUs?  Running Peter's 
> example requires four CPUs, and is greatly eased given a 
> fine-grained low-overhead time source that is synchronized 
> across all CPUs.
I believe it in fact guarantees a total order on all stores, and the
example can be made to work with sufficient fences.   I didn't explore
this in a lot of depth, and it's not all that relevant in my mind.  (For
C++, we're talking about a late 2009 standard, at best.)

> 
> I believe that I have a couple examples thus far from a 
> couple of companies of current machines (with different major 
> CPU families) that do -not- provide SC atomics without great 
> pain (e.g., having the compiler intuit where to insert 
> locking), but am still tracking this down.
> I may have to bite the bullet and wean my test program off of 
> its current dependence on the aforementioned time source.  :-(
> 
> Finally, a word of caution.  A number of HW architects I have 
> talked to thus far did not initially appreciate SC's 
> requirement for ordering of independent writes from 
> independent processors.  You might want to double-check your 
> conversations in case the HW architects you were talking to 
> are also missing this twist of SC.
Several of us have had very explicit discussions about this issue.

> 
> > (As far as I know, there even seem to be a few platforms 
> that provide 
> > full SC for all loads and stores, namely current hardware PA-RISC 
> > processors, and possibly MIPS.  This suggests to me that at least 
> > empirically the hardware tradeoffs are not obvious.  But I 
> don't think 
> > anyone wants to restrict us to fully SC hardware, certainly not I.)
> 
> But isn't it the case that all companies making 
> multiprocessor machines based on PA-RISC and MIPS are 
> transitioning to other CPU types, and have been for quite a few years?
I suspect we will see multiprocessors MIPS machines again in the
embedded space.  Otherwise, yes.  But in both of those cases, they're
transitioning to Itanium, which also provides a total order for release
stores, and can fairly easily implement SC atomics.  (I did have a long
discussion of this with one of the Itanium architects, and the decision
was quite intentional.)

I haven't looked at the statistics, but it seems to me that a large
fraction of the largest shared memory systems use either SPARC or
Itanium processors, both of which effectively provide the property we're
after.

> 
> At first glance, Doug Lea's proposal looks pretty good to me, 
> though I certainly cannot claim to fully understand it yet.  
> As I currently understand it, his model captures the most 
> valuable aspects of SC while imposing only a minimal 
> straightjacket for compiler writers and CPU/system designers.
> 
> What problems does his proposal pose from your viewpoint?
> 
I need to reread it again, which won't happen until next week.  (Thanks
for posting the update, Doug.) My main concern is that sequential
consistency for race-free programs is understandable by most
programmers, and the alternative characterizations are not.

I entirely agree with you and Doug that whether or not we guarantee
sequential consistency for independent reads of independent writes
almost never matters in practice.  But I think that's only part of the
issue.  If we want more programmers to be able to write reasonably
correct multithreaded code, we need a consistent story that's easy to
teach.  Based on what I've seen so far, many programmers are hopelessly
confused when it comes to threads, in large part because they've been
taught rules that either don't make sense or are too complicated.  I'm
not yet convinced that's fixable with the non-SC approach.

We clearly still get SC behavior for locks.  But there seems to be lots
of empirical evidence that that's not quite enough.  The various
experiences with double-checked locking are probably the strongest
example of several.

I do think that this is a sufficiently long term and important solution
that we should think about where we really want to end up, from both a
software programmability and hardware performance perspective.  (We may
then need to worry about how to get there from here ...)  

Hans