[cpp-threads] Alternatives to SC (prohibiting data races, etc.)

Tue Jan 30 05:55:10 GMT 2007

[Response to the rest of Paul's message]
On Mon, 29 Jan 2007, Paul E. McKenney wrote:

> > 3) The current N2052 memory model proposal disallows all data races
> > on ordinary variables.  This again is consistent with the Posix rules,
> > and there are strong reasons for doing so.  Reading a concurrently
> > modified data location, as in the counter example, is unsafe with
> > current compilers that may correctly perform analyses that assume such
> > changes do not occur.  Canonical example:
> >
> > 	r1 = x;
> > 	if (r1 < 2) {
> > 	    ...
> > 	    switch(r1) {
> > 	      case 0:
> > 	      case 1: ...
> > 	    }
> >         }
> >
> > Under suitable conditions, a compiler may
> >
> > - Spill r1 during the first ellipses.
> > - Reload r1 from x in the switch expression.
> > - Implement the switch statement using a branch table, and omit the
> > range check, since r1 is known to be in range.
>
> Assuming that r1 is local, this should be no problem.  If r1 is shared,
> these would indeed not be good.  But marking r1 to indicate that it is
> shared might make sense.  Note that current compilers can emit heavy
> weight code for volatile variables, so current volatile is not
> sufficient.
It is potentially a problem even if r1 is local and x is shared; that
was the intent.  If r1 needs to be spilled,
it is faster to reload it from the global than to spill to the stack, since
that saves the store.  We have been told that gcc can potentially
do this, though I haven't managed to get it to actually do so.

>
> > This means that a concurrent change to x, even if it is just used
> > for statistical purposes, will result in a wild branch.
>
> Yep!!!
>
> > For Java, there are strong reasons to prevent such compiler optimizations.
> > For C++, I believe there are not, especially since they are currently
> > fairly standard.  And there is no reason not to require programmers
> > to label races.
> >
> > (I think there have been Linux kernel discussions about dropping
> > volatile qualifiers for such loads.  Perhaps that's even happened.
> > I think that's not only technically wrong, it's dangerous.)
>
> The problem with volatile is that some compilers go overboard.  To handle
> the situations above, one must only enforce single-access discipline,
> the memory barriers that many compilers emit are not needed.
That is one reason we decided to introduce "atomic" and leave
volatile, broken as it is, alone.

>
> Please note also that the Linux kernel uses the barrier() primitive
> to handle many of these situations -- the proposals are -not- simply
> to drop "volatile", but rather to replace it with other primitives,
> including barrier().  In your example, if x is shared and r1 is local
> (not shared), then in Linux one might write:
>
>  	r1 = x;
> 	barrier();
>  	if (r1 < 2) {
>  	    ...
>  	    switch(r1) {
>  	      case 0:
>  	      case 1: ...
>  	    }
>          }
>
> The barrier() primitive is '__asm__ __volatile__("": : :"memory")',
> which tells gcc that it cannot assume that x is unchanged.  If both
> x and r1 were shared (though this does not make much sense to me for
> this code fragment), one might introduce a local variable as
> follows:
>
> 	{
> 	  int r2;
>
>  	  r1 = x;
> 	  barrier();
>  	  r2 = r1;
> 	  barrier();
>  	  if (r1 < 2) {
>  	      ...
>  	      switch(r1) {
>  	        case 0:
>  	        case 1: ...
>  	      }
>            }
>          }
>
> > (Of course all of this also breaks trivially if the load of x is
> > not atomic, as might be the case on some embedded hardware.  Thus
> > even a volatile qualifier isn't guaranteed to work, but is probably
> > sufficient in practice for Linux.)
But for our purposes, this would seem to mean that either

- every load of a shared variable has to be followed by barrier(),
which seems to be both less convenient and less optimization-friendly
than load_raw(), or

- the user has to guess where the compiler might misoptimize.

Neither seems like the right approach for us.

>
> Yep, Linux mandates that hardware atomically load and store properly
> aligned ints and pointers.
And for Linux that perfectly reasonable.  For us, I suspect it isn't,
though I don't have great data.

>
> > 4) I have very mixed feelings about load or store fences at this level.
> > I think they do make sense for strict RCU and a few similar applications.
> > I also think they tend to lead to subtlely buggy code, especially in the
> > absence of dependence-based ordering for applications that involve
> > handing off an object to another thread, which may then write it.
> > I currently haven't seen enough evidence that adding this kind of
> > ordering constraint (or fence) actually improves performance a lot
> > on major platforms.  If it does, I could live with adding it to
> > low-level atomics, though it would make the interface even wider.
>
> OK...  What do you mean by "low-level atomics"?  We do need to allow
> people to create new interfaces from low-level primitives, even if we
> strongly discourage general use of those low-level primitives.  And
> just for the record, I agree that use of explicit memory barriers should
> be minimized -- this is in fact one reason why they are subsumed into
> the RCU API -- most users of the RCU API need not concern themselves
> with memory barriers.  But the people creating RCU implementations
> -do- need to work with explicit memory barriers.  And yes, there is
> more than just me doing this for the Linux kernel.  ;-)
I'm talking about atomic operations with ordering constraints weaker
than sequential consistency here, as described in N2145.
I don't have a strong objection to also including explicit fences at this
level, if we can define precisely what they mean.

Hans