[cpp-threads] Alternatives to SC

Mon Jan 29 20:11:46 GMT 2007

> From:  Doug Lea
> 
> Hans Boehm wrote:
> 
> > 
> > 3) The current N2052 memory model proposal disallows all 
> data races on 
> > ordinary variables.
> 
> 
> It's taken me a while to see how the stance of not providing ANY
> semantics for programs with races makes C++ memory model so much
> harder and more controversial than Java.

Doug - 

I'm not sure at all sure that that's the core of the problem here.  We
do effectively require that all operations that are potentially be
involved in races be tagged.  Ordinary Java operations that may be
involved in races correspond to raw atomics in our discussion.

The details are different, in that we've tried to take advantage of that
distinction in order to avoid the Java "causality" discussion, which
doesn't seem to be tremendously popular.

> Backing up to explain:
> 
> The atomic classes perform two different roles.
> 
> The first role is to allow some variation in how you can achieve
> correct "strong" semantics. For the main example, some correct uses of
> atomic variables rely on  the combination:
>    {writes/CASes followed by StoreLoad barrier,
>     reads use only local ordering barrier}
> While others rely on the combination:
>    (writes/CASes use only local ordering barrier,
>     all reads are "confirmed" via acquire-mode CAS, entailing barrier}
> You often want the former, but most spinlocks (although not
> blocking locks), some flavors of double-check, and some other common
> constructions are more efficient using variations of the latter
> (including cases involving variables that, by virtue of monotonicity
> constraints, don't actually need to be confirmed.) But both are valid.
> However. some other usage combinations do not actually provide
> "strong" (SC or CCCC) semantics even though they are by definition
> data-race free.
> 
> The second role is to provide tools for dealing with variables that
> are sometimes used in thread-local mode (i.e., as normal variables),
> but may at other times be read or written across threads. This is
> experts-only territory.  To be used correctly, the programmer must
> have deep appreciation of the invariants governing local vs shared
> modes. In most cases, a compiler will have no chance itself of proving
> correctness of locality assumptions. Although one would hope that they
> get better at this sort of thing, they will never be perfect at it.
> This is not much different than proving correctness of programs that
> mix atomic/volatile and normal variables, where there are races
> involving the normal ones, but the ultimate result is correct.  In
> Java, one can write programs like this.  For example, I have a
> component in Java that is correct wrt the JMM with probablility 1
> because it relies on randomized processing that eventually
> converges.
> 
> I'm not sure I could write such a program in C++ under the no
> semantics for programs with data-races approach. And some of the
> linux-kernel and related examples seem to fall under this category
> too. (Hmm; maybe they will start writing such code in Java instead :-)
I think that you should be able to write these using "raw" atomics in
C++.  The intent is for that to give you atomicity at low cost, which is
kind of what you get from normal Java variables.  (Actually, I think the
Java approach of guaranteeing this for certain types and not others
wouldn't work well for C++, since it doesn't interact well with either
templates or narrow bus embedded processors.  I have no idea how
important the latter actually are for C++.)

> 
> While I was among those who agreed with the data-race-free-only
> approach for C++, it strikes me now that there is a significant
> portion of the C/C++ user base for which this is not likely to be
> desirable. So maybe it is time to revisit the underlying
> issues. Starting with your:
> 
> 
> > 
> > 	r1 = x;
> > 	if (r1 < 2) {
> > 	    ...
> > 	    switch(r1) {
> > 	      case 0:
> > 	      case 1: ...
> > 	    }
> >         }
> > 
> > Under suitable conditions, a compiler may
> > 
> > - Spill r1 during the first ellipses.
> > - Reload r1 from x in the switch expression.
> > - Implement the switch statement using a branch table, and omit the
> > range check, since r1 is known to be in range.
> > 
> > This means that a concurrent change to x, even if it is just used
> > for statistical purposes, will result in a wild branch.
> > 
> 
> 
> 
> In this and related cases, all that a compiler needs to do is assign
> an activation-stack-frame location to hold the spill.  (That's what's
> done in Java VMs). I have trouble believing that this is not a better
> option anyway -- in part because later reading from frame is 
> very unlikely
> to encounter a cache miss.
> 
> I think the other issues along these lines involve invented
> writes. These might be more controversial. Do you have a list of
> them handy?
I don't, and that's really what would scare me about trying to change
this.  I think this assumption is pervasive in all kinds of compiler
analyses, and nobody remembers where it was made.  Note that the above
example still has the same problem if x is mentioned explicitly
everywhere that r1 is used (and the r1 = x assignment is removed).
Conventional flow analysis will still prove (assuming that x is
unsigned) that x has to be either 0 or 1 when the switch statement is
entered, end hence the bounds check for the branch table can be elided.

It seems to me that essentially any flow- or context-sensitive compiler
analysis makes this sort of assumption by default.  For Java, compiler
implementers were forced to think about this from the start.  (And a
smaller number probably actually did :-) )  For C++, the tradition is
mostly to think about single-threaded executions.  And Posix basically
tried to say that was OK.  And I suspect that at least until very
recently, Microsoft largely followed those rules as well.

The fact that C++ performance tends to rely on heavyweight static
optimization, where Java performance relies more on dynamic information
and lighter-weight optimizations also doesn't help.

Among the other consequences of trying to "fix" this:

- We need to specify a set of types that are always atomically loaded
and stored.  My guess is that this would not fly in the C/C++ embedded
systems community.
- We sometimes need fences after vtable pointer initialization in stack
allocated objects.

Hans