[cpp-threads] Alternatives to SC (prohibiting data races, etc.)

Tue Jan 30 18:41:56 GMT 2007

On Mon, Jan 29, 2007 at 09:55:10PM -0800, Hans Boehm wrote:
> [Response to the rest of Paul's message]
> On Mon, 29 Jan 2007, Paul E. McKenney wrote:
> 
> > > 3) The current N2052 memory model proposal disallows all data races
> > > on ordinary variables.  This again is consistent with the Posix rules,
> > > and there are strong reasons for doing so.  Reading a concurrently
> > > modified data location, as in the counter example, is unsafe with
> > > current compilers that may correctly perform analyses that assume such
> > > changes do not occur.  Canonical example:
> > >
> > > 	r1 = x;
> > > 	if (r1 < 2) {
> > > 	    ...
> > > 	    switch(r1) {
> > > 	      case 0:
> > > 	      case 1: ...
> > > 	    }
> > >         }
> > >
> > > Under suitable conditions, a compiler may
> > >
> > > - Spill r1 during the first ellipses.
> > > - Reload r1 from x in the switch expression.
> > > - Implement the switch statement using a branch table, and omit the
> > > range check, since r1 is known to be in range.
> >
> > Assuming that r1 is local, this should be no problem.  If r1 is shared,
> > these would indeed not be good.  But marking r1 to indicate that it is
> > shared might make sense.  Note that current compilers can emit heavy
> > weight code for volatile variables, so current volatile is not
> > sufficient.
>
> It is potentially a problem even if r1 is local and x is shared; that
> was the intent.  If r1 needs to be spilled,
> it is faster to reload it from the global than to spill to the stack, since
> that saves the store.  We have been told that gcc can potentially
> do this, though I haven't managed to get it to actually do so.

In Linux we indeed assume (rightly or wrongly) that gcc might reload
from a global unless told not to via barrier() or similar primitive.

And if the code were single-threaded (either inherently or because the
snippet of code was under a lock that prevented changes to x), then
I am perfectly OK with gcc reloading x.  ;-)

> > > This means that a concurrent change to x, even if it is just used
> > > for statistical purposes, will result in a wild branch.
> >
> > Yep!!!
> >
> > > For Java, there are strong reasons to prevent such compiler optimizations.
> > > For C++, I believe there are not, especially since they are currently
> > > fairly standard.  And there is no reason not to require programmers
> > > to label races.
> > >
> > > (I think there have been Linux kernel discussions about dropping
> > > volatile qualifiers for such loads.  Perhaps that's even happened.
> > > I think that's not only technically wrong, it's dangerous.)
> >
> > The problem with volatile is that some compilers go overboard.  To handle
> > the situations above, one must only enforce single-access discipline,
> > the memory barriers that many compilers emit are not needed.
>
> That is one reason we decided to introduce "atomic" and leave
> volatile, broken as it is, alone.

Understood.

> > Please note also that the Linux kernel uses the barrier() primitive
> > to handle many of these situations -- the proposals are -not- simply
> > to drop "volatile", but rather to replace it with other primitives,
> > including barrier().  In your example, if x is shared and r1 is local
> > (not shared), then in Linux one might write:
> >
> >  	r1 = x;
> > 	barrier();
> >  	if (r1 < 2) {
> >  	    ...
> >  	    switch(r1) {
> >  	      case 0:
> >  	      case 1: ...
> >  	    }
> >          }
> >
> > The barrier() primitive is '__asm__ __volatile__("": : :"memory")',
> > which tells gcc that it cannot assume that x is unchanged.  If both
> > x and r1 were shared (though this does not make much sense to me for
> > this code fragment), one might introduce a local variable as
> > follows:
> >
> > 	{
> > 	  int r2;
> >
> >  	  r1 = x;
> > 	  barrier();
> >  	  r2 = r1;
> > 	  barrier();
> >  	  if (r1 < 2) {
> >  	      ...
> >  	      switch(r1) {
> >  	        case 0:
> >  	        case 1: ...
> >  	      }
> >            }
> >          }
> >
> > > (Of course all of this also breaks trivially if the load of x is
> > > not atomic, as might be the case on some embedded hardware.  Thus
> > > even a volatile qualifier isn't guaranteed to work, but is probably
> > > sufficient in practice for Linux.)
>
> But for our purposes, this would seem to mean that either
> 
> - every load of a shared variable has to be followed by barrier(),
> which seems to be both less convenient and less optimization-friendly
> than load_raw(), or
> 
> - the user has to guess where the compiler might misoptimize.
> 
> Neither seems like the right approach for us.

If load_raw() can be used as follows:

 	r1 = load_raw(x);
 	if (r1 < 2) {
 	    ...
 	    switch(r1) {
 	      case 0:
 	      case 1: ...
 	    }
         }

and prevent x from being reloaded without issuing extra locks or
memory barriers, I -think- that it does at least part of what it
needs to.  It is essentially making the compiler forget where it
got r1 from, correct?

> > Yep, Linux mandates that hardware atomically load and store properly
> > aligned ints and pointers.
>
> And for Linux that perfectly reasonable.  For us, I suspect it isn't,
> though I don't have great data.

Well, a good part of this discussion has indeed circled around exactly
what requirements we should impose on the underlying hardware.  ;-)

But even the old 8080 would meet this requirement -- you would use
the lhld instruction to get an atomic 16-bit load of a pointer, which
suffices for this 8-bit CPU.  The 8080 would -not- be able to atomically
load a 32-bit quantity -- the memories are quite faded, but I vaguely
remember longs being 32 bits in Whitesmiths C.  Nor would it be able
to atomically fetch a composite pointer that might be used to bank-switch
more than 64K of memory.  However, in the project I was working on, it was
the caller's responsibility to make sure that the correct memory was
mapped before dereferencing the pointer -- data was segregated by type
of structure.  So an atomic 16-bit fetch/store would suffice.  Other
projects might well have made different choices.

So here are the choices I can see:

1.	Pure least-common-denominator.  This will likely result in
	atomics being restricted to char.  I believe that this would
	be unacceptable.

2.	The machine must be able to load and store all basic types
	atomically, otherwise it cannot claim support of atomics.
	Some older machines might need to fetch double-precision
	floating point in two pieces -- but then again, there are
	machines that simply refuse to support floating point in
	the first place.  I have heard rumors about some sort of
	new decimal floating point type, but know very little about it.

3.	Define different levels of support.  All machines support
	atomic char.  All the popular machines I am aware of support
	atomic integer types.  All the mainstream machines I am aware
	of support atomic binary floating point.  The usual tricks
	could be used to support atomic structures that fit into one of
	these basic types, e.g., bitfields.  Very few machines support
	large atomic aggregates (there are some research prototypes that
	support transactional memory, but this does not seem ready for
	standardization).

	This is the defacto situation today with floating point.
	Most machines and environments support it, but things work in
	environments that do not.  For example, many operating system
	kernels forbid use of floating point in order to optimize
	context-switch latency.  Similarly, many embedded CPUs don't
	support floating point in any real way (yes, there might be
	library functions, but forget about having any memory left over
	should you be so foolish as to cause them to be included in
	your program).

4.	Require that "atomic" be applicable to -all- data structures.
	To make this work, the compiler has to know quite a bit about
	the environment.  Does it emit cli/sti instructions around
	access to an atomic aggregate?	spin_lock()/spin_unlock()?
	pthread_mutex_lock()/ pthread_mutex_unlock()?  And so on...

	And what constitutes an "access"?  A single load or store
	to a field in an atomic structure?  Loading and storing the
	aggregate in toto?  A member function in C++?  If the latter,
	what about mutually recursive member functions in different C++
	structures/classes?

	I don't believe that this last is workable.  If the transactional
	memory guys are on the right path, then perhaps this can appear
	in the future.	However, I do have some questions for those guys
	about member functions that do RPCs to other transactional-memory
	machines -- and that is assuming that I am willing to cut them
	a break on programs accessing MMIO device registers!.

Any other options?  Refinements of the above options?  At this point, my
guess is that #3 is the only workable option.

> > > 4) I have very mixed feelings about load or store fences at this level.
> > > I think they do make sense for strict RCU and a few similar applications.
> > > I also think they tend to lead to subtlely buggy code, especially in the
> > > absence of dependence-based ordering for applications that involve
> > > handing off an object to another thread, which may then write it.
> > > I currently haven't seen enough evidence that adding this kind of
> > > ordering constraint (or fence) actually improves performance a lot
> > > on major platforms.  If it does, I could live with adding it to
> > > low-level atomics, though it would make the interface even wider.
> >
> > OK...  What do you mean by "low-level atomics"?  We do need to allow
> > people to create new interfaces from low-level primitives, even if we
> > strongly discourage general use of those low-level primitives.  And
> > just for the record, I agree that use of explicit memory barriers should
> > be minimized -- this is in fact one reason why they are subsumed into
> > the RCU API -- most users of the RCU API need not concern themselves
> > with memory barriers.  But the people creating RCU implementations
> > -do- need to work with explicit memory barriers.  And yes, there is
> > more than just me doing this for the Linux kernel.  ;-)
>
> I'm talking about atomic operations with ordering constraints weaker
> than sequential consistency here, as described in N2145.
> I don't have a strong objection to also including explicit fences at this
> level, if we can define precisely what they mean.

OK.  We might be on nearly the same page, then.

							Thanx, Paul