[cpp-threads] Alternatives to SC

Fri Jan 12 22:51:46 GMT 2007

On Fri, Jan 12, 2007 at 02:53:36PM -0600, Boehm, Hans wrote:
> > -----Original Message-----
> > From: Paul E. McKenney [mailto:paulmck at linux.vnet.ibm.com] 
> > 
> > This is in fact what I understood.  I have deep reservations 
> > about legislating SC even for specially marked atomics.
> > 
> > To see the basis of my concerns, please consider a 
> > generalization of the example that Peter Dimov recently 
> > posted to this thread (with your
> > corrections) from 4 threads to N+2 threads.  Please 
> > especially consider the case where the two readers are 
> > running on different dies, but each sharing dies with 
> > different writers.
>
> As Sarita has pointed out in other discussions, this basically requires
> that a reader not return another processor's write until all processors
> have seen the write.  Many current cache coherence protocols, including
> NUMA ones, do ensure this.  We have also seen a bunch of papers arguing
> that you can largely hide any additional latencies from this sort of
> thing with sufficient hardware speculation.  And, at least
> theoretically, this only has to apply to specially marked loads and
> stores.

Which has the effect of preventing local-speed interaction of a pair
of tasks sharing the same CPU core (or die or NUMA node or whatever).
Yes, in some cases, you -can- hide the latency, but if the operation is
on the critical path for an aggressive realtime workload, life is hard.
(By "aggressive", I mean deadlines in the tens of microseconds for
operations that include scheduling latency, IPC, etc.)

Again, full-up SC requires that -every- atomic write-to-read interaction
incur a delay greater than the electrical diameter of the full computer
system.  Even if the reader and the writer are on different hardware
threads within the same core.

This is a Very Bad Thing, especially for large systems.

In contrast, my admittedly incomplete understanding of Doug Lea's proposal
leads me to believe that an atomic write-to-read interaction between
a pair of hardware threads within the same core would incur a delay
corresponding to only the electrical diameter of that core -- which is
much smaller than that of the full computer system, particularly for
large systems that have multiple dies, let alone even larger systems
that have multiple motherboards (though the big-system guys usually
don't call them motherboards).

> I will certainly admit that I don't fully understand the hardware issues
> here.  But I haven't seen particularly clear, confident, or consistent
> answers from the experts as to the costs involved.

Part of the problem may be that different hardware designers have
different concerns.  A hardware designer focused solely on a multithreaded
single-core CPU is likely to have absolutely no problem implementing
full SC, and for some designs might even get SC for free.  A hardware
designer doing a single-die multi-core CPU will pay a penalty for SC,
but depending on the number of cores and on the layout of the chip,
this penalty might be nonfatal, at least for non-realtime workloads.
Larger systems requiring multiple dies on an MCM or even multiple circuit
boards with interconnects are going to have a -lot- more trouble with SC.

Of course, it has become popular of late to assert that all systems
will eventually have huge numbers of CPUs on a single die.  However,
my experience indicates that the endgame will instead be dies with a
modest number of CPUs, but also including all the other components
required to make a viable computer system (I/O, memory, cache) --
inexorable economics will continue to drive the big-system designers to
build with commodity components, and the same economics will continue
to drive commodity systems to cram everything on as few chips as possible.

I believe that this future has sufficient probability that it would
be unwise to exclude it from the standard.

> > > We do clearly have sufficient hardware support to support 
> > such as SC 
> > > atomics adequately on some platforms.  (Amusingly, Alpha is 
> > probably 
> > > one of those?) We have issues, or at least lack of clarity, 
> > on others.
> > 
> > I don't have access to an Alpha to test this out.  Can Alpha 
> > really be made to fully order independent writes to 
> > independent variables across all CPUs?  Running Peter's 
> > example requires four CPUs, and is greatly eased given a 
> > fine-grained low-overhead time source that is synchronized 
> > across all CPUs.
>
> I believe it in fact guarantees a total order on all stores, and the
> example can be made to work with sufficient fences.   I didn't explore
> this in a lot of depth, and it's not all that relevant in my mind.  (For
> C++, we're talking about a late 2009 standard, at best.)

Understood on the timing.  I honestly don't know if DEC's large Alpha
systems kept to those guarantees.  I -do- know that it is much easier
to force non-SC failure in larger systems than in smaller ones.

> > I believe that I have a couple examples thus far from a 
> > couple of companies of current machines (with different major 
> > CPU families) that do -not- provide SC atomics without great 
> > pain (e.g., having the compiler intuit where to insert 
> > locking), but am still tracking this down.
> > I may have to bite the bullet and wean my test program off of 
> > its current dependence on the aforementioned time source.  :-(
> > 
> > Finally, a word of caution.  A number of HW architects I have 
> > talked to thus far did not initially appreciate SC's 
> > requirement for ordering of independent writes from 
> > independent processors.  You might want to double-check your 
> > conversations in case the HW architects you were talking to 
> > are also missing this twist of SC.
>
> Several of us have had very explicit discussions about this issue.

With architects responsible for large systems?  SGI?  IBM?

> > > (As far as I know, there even seem to be a few platforms 
> > that provide 
> > > full SC for all loads and stores, namely current hardware PA-RISC 
> > > processors, and possibly MIPS.  This suggests to me that at least 
> > > empirically the hardware tradeoffs are not obvious.  But I 
> > don't think 
> > > anyone wants to restrict us to fully SC hardware, certainly not I.)
> > 
> > But isn't it the case that all companies making 
> > multiprocessor machines based on PA-RISC and MIPS are 
> > transitioning to other CPU types, and have been for quite a few years?
>
> I suspect we will see multiprocessors MIPS machines again in the
> embedded space.  Otherwise, yes.  But in both of those cases, they're
> transitioning to Itanium, which also provides a total order for release
> stores, and can fairly easily implement SC atomics.  (I did have a long
> discussion of this with one of the Itanium architects, and the decision
> was quite intentional.)
>
> I haven't looked at the statistics, but it seems to me that a large
> fraction of the largest shared memory systems use either SPARC or
> Itanium processors, both of which effectively provide the property we're
> after.

Does SGI's Altix really provide the Itanium guarantee of global ordering
on release operations on its 1024-CPU machines?

> > At first glance, Doug Lea's proposal looks pretty good to me, 
> > though I certainly cannot claim to fully understand it yet.  
> > As I currently understand it, his model captures the most 
> > valuable aspects of SC while imposing only a minimal 
> > straightjacket for compiler writers and CPU/system designers.
> > 
> > What problems does his proposal pose from your viewpoint?
> > 
> I need to reread it again, which won't happen until next week.  (Thanks
> for posting the update, Doug.) My main concern is that sequential
> consistency for race-free programs is understandable by most
> programmers, and the alternative characterizations are not.
> 
> I entirely agree with you and Doug that whether or not we guarantee
> sequential consistency for independent reads of independent writes
> almost never matters in practice.  But I think that's only part of the
> issue.  If we want more programmers to be able to write reasonably
> correct multithreaded code, we need a consistent story that's easy to
> teach.  Based on what I've seen so far, many programmers are hopelessly
> confused when it comes to threads, in large part because they've been
> taught rules that either don't make sense or are too complicated.  I'm
> not yet convinced that's fixable with the non-SC approach.
> 
> We clearly still get SC behavior for locks.  But there seems to be lots
> of empirical evidence that that's not quite enough.  The various
> experiences with double-checked locking are probably the strongest
> example of several.

Double-checked locking is certainly an example crying out for a
higher-level primitive.  It makes no more sense for programmers to
continually re-invent broken double-checked locking than it would for
them to continually re-invent broken binary-tree update algorithms,
wouldn't you agree?  Using appropriate abstractions will deliver much
better results -- in both usability and in performance -- than SC can
ever hope to.

> I do think that this is a sufficiently long term and important solution
> that we should think about where we really want to end up, from both a
> software programmability and hardware performance perspective.  (We may
> then need to worry about how to get there from here ...)  

It certainly is hard to argue with this last paragraph!  ;-)

							Thanx, Paul