[cpp-threads] Yet another visibility question

Fri Dec 22 21:28:55 GMT 2006

> From:  Paul E. McKenney
> > I think currently that's violated only for implementations of foo() 
> > that exhibit undefined semantics.  If we allow out of thin 
> air values, 
> > foo() could return &x, and legitimate programs could return nonzero 
> > from f(), which argues that some tools might have to 
> consider the possibility.
> > Presumably, none of this could happen in a real 
> implementation, which 
> > argues that this would be a weird behavior to standardize.
> 
> If foo() makes use of primitives similar to gcc's (admittedly 
> nonstandard)
> __builtin_frame_address() primitive, then foo() might well be 
> able to affect the value of x, and thus the value returned 
> from f().  I would certainly hope that people would avoid 
> doing this sort of thing except for debugging, but you did ask...
I agree that we can already get a hold of the address of the local using
platform extensions and the like.  But I still think there is currently
no standard way to do so, and I think it's fairly clear that compiler
optimization currently gets to ignore aliases created like this.  If we
allowed out-of-thin-air results, we'd at least be seriously muddying the
water.

> 
> FWIW, here is a quick summary of the ordering approach taken 
> by the Linux kernel.
> 
> There are two types of directives:
> 
> o	Compiler control, done either through the "barrier()" directive
> 	or via the "memory" option to a gcc asm directive.
> 
> o	CPU control, via explicit memory-barrier directives:
> 
> 	o	smp_mb() and mb() are full memory barriers.
> 
> 	o	smp_rmb() and rmb() are memory barriers that
> 		are only guaranteed to segregate reads.
> 
> 	o	smp_wmb() and wmb() are memory barriers that
> 		are only guaranteed to segregate writes.
> 
> 	The "smp_" prefix causes code to be generated only in
> 	an SMP build.  So, for example, smp_wmb() would be used
> 	to order a pair of writes making up some sort of
> 	synchronization primitive, while wmb() would be used
> 	to order MMIO writes in device drivers.
> 
> 	This distinction is due to the fact that a single CPU
> 	will always see its one cached references to normal
> 	memory in program order, while reordering can happen
> 	(even on a single-CPU system) from the viewpoint of
> 	an I/O device.
> 
> 	(Note that this is not a complete list, but covers
> 	the most commonly used primitives.)
> 
> The CPU-control directives imply compiler-control directives, 
> since there is little point in forcing ordering at the CPU if 
> one has not also forced ordering in the compiler.  However, 
> it -is- useful to force ordering at the compiler only, for 
> example, when writing code that interacts with an interrupt handler.
> 
> Most uses of both the compiler- and SMP-CPU-control 
> directives are "buried" in higher-level primitives.  For example:
> 
> 	Primitive	Uses
> 	---------	----
> 	barrier():	 564
> 	smp_mb():	 171
> 	smp_rmb():	  87
> 	smp_wmb():	 150
> 
> The unconditional CPU-control directives are used more 
> heavily, due to the fact that the many hundreds of device 
> drivers in the Linux kernel must make use of them:
> 
> 	Primitive	Uses
> 	---------	----
> 	mb():		2480
> 	rmb():		 168
> 	wmb():		 587
> 
> These numbers might seem large, but there are more than 
> 40,000 instances of locking primitives in the Linux kernel, 
> -not- counting cases where a locking primitive use is 
> "wrappered" in a small function or macro.
> 
> Applications that don't interact with either interrupts or 
> signal handlers might have less need for the raw CPU-control 
> primitives.
> 
Thanks.

This has been addressed to some extent in some of the earlier papers,
notably http://open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html .

I don't believe this is the right approach for us, for several reasons,
listed below.  But I think it's important to get this issue reasolved,
since it's fundamental enough to both the atomics interface and the
memory model that we cannot change our mind at the last minute.  My
reasons for not going this route:

1) It doesn't allow you to express the correct constraint for even
something like a spin-lock release.  It would have to be written as
something like

	mb(); lock = 0;

where lock is an atomic and assignment corresponds to what we have been
calling store_raw.  This in turn would have to result in an mfence on
X86, an mf on Itanium, or a full sync on PowerPC, all of which are
substantial overkill.  I think linux avoids this with custom assembly
code for things like spinlock releases.

In general when you set a variable in order to hand off some state, you
need to make sure that prior memory operations are not reoredered with
respect to the setting of the variable; you do not care what happens to
later memory operations.  A fence based formulation ends up specifying
too strong a constraint, which you then have to respect in the
implementation.

2) Linux subsequently had to adapt this scheme with the various
mb__after and mb__before, and read_barrier_depends primitives.  This
strikes me as significantly more complicated than what we have been
discussing, and it does a worse job at expressing the correct
constraints than the acquire/release formulation.  (And I don't believe
that the semantics of read_barrier_depends() are actually definable,
which was the subject of the previous discussion here.)

3) It's completely inconsistent with the Java "volatile" approach.  Thus
it would force many programmers to understand two completely different
approaches.

(I think the Linux approach has some other problems, which are less
relevant here, and are probably not completely fixable until we do our
job.  We do need to identify loads which can be involved in races, even
if the fences are explicit.  Otherwise legitimate compiler
transformations may break the code.  My impression is that Linux
sometimes, but not always, uses volatile for this purpose, which is
somewhat sufficient in practice, but not guaranteed to work, but may
involve other significant penalties.)

Hans