[cpp-threads] Alternatives to SC (prohibiting data races, etc.)

Wed Jan 31 17:54:33 GMT 2007

On Tue, Jan 30, 2007 at 04:16:26PM -0600, Boehm, Hans wrote:
> > From: Paul E. McKenney [mailto:paulmck at linux.vnet.ibm.com] 
> > 
> > If load_raw() can be used as follows:
> > 
> >  	r1 = load_raw(x);
> >  	if (r1 < 2) {
> >  	    ...
> >  	    switch(r1) {
> >  	      case 0:
> >  	      case 1: ...
> >  	    }
> >          }
> > 
> > and prevent x from being reloaded without issuing extra locks 
> > or memory barriers, I -think- that it does at least part of 
> > what it needs to.  It is essentially making the compiler 
> > forget where it got r1 from, correct?
>
> Right. And that was always the intent.  With load_raw the wild branch is
> no longer possible.  And I think we've imposed considerably fewer
> compiler constraints than barrier().  And the code remains annotated
> with a warning that there is an "intentional race" here (though we no
> longer call it a data race, since it involves atomics).

OK -- but I didn't see load_raw() in N2145.  This is something you
are proposing adding -- or did I get the wrong document?

If I understand it correctly, your load_raw() would allow devlopers
to implement per-thread counters, which is good.  However, it would
not suffice for rcu_dereference() on platforms like Alpha.  Adding
the explicit memory barrier required for Alpha penalizes other platforms.

One possibility would be load_raw_dep() or some such that causes the
compiler to "forget" where the argument came from (same as load_raw()),
and also to issue an rmb() (LoadLoad) on platforms such as Alpha that
to not respect ordering of dependent loads.

Does this seem reasonable?

> > > > Yep, Linux mandates that hardware atomically load and 
> > store properly 
> > > > aligned ints and pointers.
> > >
> > > And for Linux that perfectly reasonable.  For us, I suspect 
> > it isn't, 
> > > though I don't have great data.
> > 
> > Well, a good part of this discussion has indeed circled 
> > around exactly what requirements we should impose on the 
> > underlying hardware.  ;-)
> > 
> > But even the old 8080 would meet this requirement -- you 
> > would use the lhld instruction to get an atomic 16-bit load 
> > of a pointer, which suffices for this 8-bit CPU.  The 8080 
> > would -not- be able to atomically load a 32-bit quantity -- 
> > the memories are quite faded, but I vaguely remember longs 
> > being 32 bits in Whitesmiths C.  Nor would it be able to 
> > atomically fetch a composite pointer that might be used to 
> > bank-switch more than 64K of memory.  However, in the project 
> > I was working on, it was the caller's responsibility to make 
> > sure that the correct memory was mapped before dereferencing 
> > the pointer -- data was segregated by type of structure.  So 
> > an atomic 16-bit fetch/store would suffice.  Other projects 
> > might well have made different choices.
> > 
> > So here are the choices I can see:
> > 
> > 1.	Pure least-common-denominator.  This will likely result in
> > 	atomics being restricted to char.  I believe that this would
> > 	be unacceptable.
> > 
> > 2.	The machine must be able to load and store all basic types
> > 	atomically, otherwise it cannot claim support of atomics.
> > 	Some older machines might need to fetch double-precision
> > 	floating point in two pieces -- but then again, there are
> > 	machines that simply refuse to support floating point in
> > 	the first place.  I have heard rumors about some sort of
> > 	new decimal floating point type, but know very little about it.
> > 
> > 3.	Define different levels of support.  All machines support
> > 	atomic char.  All the popular machines I am aware of support
> > 	atomic integer types.  All the mainstream machines I am aware
> > 	of support atomic binary floating point.  The usual tricks
> > 	could be used to support atomic structures that fit into one of
> > 	these basic types, e.g., bitfields.  Very few machines support
> > 	large atomic aggregates (there are some research prototypes that
> > 	support transactional memory, but this does not seem ready for
> > 	standardization).
> > 
> > 	This is the defacto situation today with floating point.
> > 	Most machines and environments support it, but things work in
> > 	environments that do not.  For example, many operating system
> > 	kernels forbid use of floating point in order to optimize
> > 	context-switch latency.  Similarly, many embedded CPUs don't
> > 	support floating point in any real way (yes, there might be
> > 	library functions, but forget about having any memory left over
> > 	should you be so foolish as to cause them to be included in
> > 	your program).
> > 
> > 4.	Require that "atomic" be applicable to -all- data structures.
> > 	To make this work, the compiler has to know quite a bit about
> > 	the environment.  Does it emit cli/sti instructions around
> > 	access to an atomic aggregate?	spin_lock()/spin_unlock()?
> > 	pthread_mutex_lock()/ pthread_mutex_unlock()?  And so on...
> > 
> > 	And what constitutes an "access"?  A single load or store
> > 	to a field in an atomic structure?  Loading and storing the
> > 	aggregate in toto?  A member function in C++?  If the latter,
> > 	what about mutually recursive member functions in different C++
> > 	structures/classes?
> > 
> > 	I don't believe that this last is workable.  If the 
> > transactional
> > 	memory guys are on the right path, then perhaps this can appear
> > 	in the future.	However, I do have some questions for those guys
> > 	about member functions that do RPCs to other 
> > transactional-memory
> > 	machines -- and that is assuming that I am willing to cut them
> > 	a break on programs accessing MMIO device registers!.
> > 
> > Any other options?  Refinements of the above options?  At 
> > this point, my guess is that #3 is the only workable option.
> > 
> That's essentially what N2145 does.  It promises to provide atomic_xyz
> for various (currently non-fp) builtin types xyz, and then provides a
> way to ask whether the available implementation is actually lock-free,
> or emulated with locks.  It is understood that the emulated version is
> not universally useful, but I think that for portable user level code,
> it's often sufficient to make the code work.  The generic atomic<T>
> behaves the same way, though it's more likely to be emulated with locks
> on a particular platform.

So atomic<char>::lock_free() would normally (always?) return true, while
atomic<struct foo>::lock_free() would normally return false, right?

						Thanx, Paul