[cpp-threads] Yet another visibility question

Sat Dec 23 03:11:54 GMT 2006

On Fri, Dec 22, 2006 at 03:28:55PM -0600, Boehm, Hans wrote:
> > From:  Paul E. McKenney
> > > I think currently that's violated only for implementations of foo() 
> > > that exhibit undefined semantics.  If we allow out of thin 
> > air values, 
> > > foo() could return &x, and legitimate programs could return nonzero 
> > > from f(), which argues that some tools might have to 
> > consider the possibility.
> > > Presumably, none of this could happen in a real 
> > implementation, which 
> > > argues that this would be a weird behavior to standardize.
> > 
> > If foo() makes use of primitives similar to gcc's (admittedly 
> > nonstandard)
> > __builtin_frame_address() primitive, then foo() might well be 
> > able to affect the value of x, and thus the value returned 
> > from f().  I would certainly hope that people would avoid 
> > doing this sort of thing except for debugging, but you did ask...
> I agree that we can already get a hold of the address of the local using
> platform extensions and the like.  But I still think there is currently
> no standard way to do so, and I think it's fairly clear that compiler
> optimization currently gets to ignore aliases created like this.  If we
> allowed out-of-thin-air results, we'd at least be seriously muddying the
> water.

Fair enough.

> > FWIW, here is a quick summary of the ordering approach taken 
> > by the Linux kernel.
> > 
> > There are two types of directives:
> > 
> > o	Compiler control, done either through the "barrier()" directive
> > 	or via the "memory" option to a gcc asm directive.
> > 
> > o	CPU control, via explicit memory-barrier directives:
> > 
> > 	o	smp_mb() and mb() are full memory barriers.
> > 
> > 	o	smp_rmb() and rmb() are memory barriers that
> > 		are only guaranteed to segregate reads.
> > 
> > 	o	smp_wmb() and wmb() are memory barriers that
> > 		are only guaranteed to segregate writes.
> > 
> > 	The "smp_" prefix causes code to be generated only in
> > 	an SMP build.  So, for example, smp_wmb() would be used
> > 	to order a pair of writes making up some sort of
> > 	synchronization primitive, while wmb() would be used
> > 	to order MMIO writes in device drivers.
> > 
> > 	This distinction is due to the fact that a single CPU
> > 	will always see its one cached references to normal
> > 	memory in program order, while reordering can happen
> > 	(even on a single-CPU system) from the viewpoint of
> > 	an I/O device.
> > 
> > 	(Note that this is not a complete list, but covers
> > 	the most commonly used primitives.)
> > 
> > The CPU-control directives imply compiler-control directives, 
> > since there is little point in forcing ordering at the CPU if 
> > one has not also forced ordering in the compiler.  However, 
> > it -is- useful to force ordering at the compiler only, for 
> > example, when writing code that interacts with an interrupt handler.
> > 
> > Most uses of both the compiler- and SMP-CPU-control 
> > directives are "buried" in higher-level primitives.  For example:
> > 
> > 	Primitive	Uses
> > 	---------	----
> > 	barrier():	 564
> > 	smp_mb():	 171
> > 	smp_rmb():	  87
> > 	smp_wmb():	 150
> > 
> > The unconditional CPU-control directives are used more 
> > heavily, due to the fact that the many hundreds of device 
> > drivers in the Linux kernel must make use of them:
> > 
> > 	Primitive	Uses
> > 	---------	----
> > 	mb():		2480
> > 	rmb():		 168
> > 	wmb():		 587
> > 
> > These numbers might seem large, but there are more than 
> > 40,000 instances of locking primitives in the Linux kernel, 
> > -not- counting cases where a locking primitive use is 
> > "wrappered" in a small function or macro.
> > 
> > Applications that don't interact with either interrupts or 
> > signal handlers might have less need for the raw CPU-control 
> > primitives.
> > 
> Thanks.
> 
> This has been addressed to some extent in some of the earlier papers,
> notably http://open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html .
> 
> I don't believe this is the right approach for us, for several reasons,
> listed below.  But I think it's important to get this issue reasolved,
> since it's fundamental enough to both the atomics interface and the
> memory model that we cannot change our mind at the last minute.  My
> reasons for not going this route:
> 
> 1) It doesn't allow you to express the correct constraint for even
> something like a spin-lock release.  It would have to be written as
> something like
> 
> 	mb(); lock = 0;
> 
> where lock is an atomic and assignment corresponds to what we have been
> calling store_raw.  This in turn would have to result in an mfence on
> X86, an mf on Itanium, or a full sync on PowerPC, all of which are
> substantial overkill.  I think linux avoids this with custom assembly
> code for things like spinlock releases.

Linux does indeed provide custom assembly for spinlocks.  However,
last I checked, glibc also provides custom assembly for the user-mode
equivalent, namely pthread_mutex_locks.  So I don't agree with use
of spinlocks as a driving use case, though I do agree that there are
other use cases that would benefit from more specific constraints.

> In general when you set a variable in order to hand off some state, you
> need to make sure that prior memory operations are not reoredered with
> respect to the setting of the variable; you do not care what happens to
> later memory operations.  A fence based formulation ends up specifying
> too strong a constraint, which you then have to respect in the
> implementation.

And one can come up with cases where even a semi-permeable fence is
overly constraining.  For example, full memory barriers unnecessarily
constrain the order of references to local variables such as compiler
temporaries.  That said, there may well be a tradeoff between complexity
and expressiveness.

> 2) Linux subsequently had to adapt this scheme with the various
> mb__after and mb__before, and read_barrier_depends primitives.  This
> strikes me as significantly more complicated than what we have been
> discussing, and it does a worse job at expressing the correct
> constraints than the acquire/release formulation.

Linux certainly could have just insisted that atomic_inc() and friends
include a memory barrier, but that was rejected due to the overhead.
And there are cases where the increment must be atomic, but need not
be ordered.

>                                                    (And I don't believe
> that the semantics of read_barrier_depends() are actually definable,
> which was the subject of the previous discussion here.)

By "previous discussion", you mean the emails on this list in May of 2005?

If by "definable" you mean "precisely defined in all contexts in which it
could be used without getting a compilation error", then I won't argue.
Otherwise, as you no doubt guessed, it means that a prior fetch of an
index or pointer must be ordered before any subsequent dereferencing
using that index or pointer.  For example, in:

	a = 1;
	read_barrier_depends();
	b = 2;

it might or might not force any ordering.  Currently it does on Alpha,
but not necessarily on any other platform (x86's store ordering could
be defeated by compiler optimizations, for example).

However, in:

	a = global_pointer;
	read_barrier_depends();
	b = a->data;

it must force the fetch of global_pointer to precede the fetch of a->data.
Similarly, in:

	a = global_index;
	read_barrier_depends();
	b = global_array[a];

it must force the fetch of global_pointer to precede the fetch of
global_array[a].

Is the distinction between the first example on the one hand and the
last two examples on the other what you were getting at?  Or am I missing
the point?

> 3) It's completely inconsistent with the Java "volatile" approach.  Thus
> it would force many programmers to understand two completely different
> approaches.

Hmmm...  There are a lot of other things that are completely different
between C/C++ and Java, but consistency is good -- if all else is equal.

> (I think the Linux approach has some other problems, which are less
> relevant here, and are probably not completely fixable until we do our
> job.  We do need to identify loads which can be involved in races, even
> if the fences are explicit.  Otherwise legitimate compiler
> transformations may break the code.  My impression is that Linux
> sometimes, but not always, uses volatile for this purpose, which is
> somewhat sufficient in practice, but not guaranteed to work, but may
> involve other significant penalties.)

Linux uses the "volatile" storage class and "barrier()".  The former has
been the subject of repeated spirited discussion on the Linux-kernel
mailing list, but is nonetheless used rather heavily.  The latter
invokes a gcc extension that prevents the compiler from reordering memory
references across it.

The explicit fences use the same gcc extension that barrier() does
to prevent compiler reordering.  So, when you say "not guaranteed to
work", you are saying that gcc does not always respect the "memory"
directive to its __asm__ extension?  Or is there some subtlety here
that I am missing?

						Thanx, Paul