[cpp-threads] Yet another visibility question

Thu Dec 21 22:25:36 GMT 2006

On Wed, Dec 20, 2006 at 07:26:51PM -0600, Boehm, Hans wrote:
> > From:  Doug Lea
> > 
> > Boehm, Hans wrote:
> > 
> > > Con (a showstopper, I think): We don't know how to express 
> > > dependency-based ordering in a way that's not broken by 
> > conventional 
> > > optimizations on other compilation units that don't mention 
> > atomics, 
> > > and remains usable of the store takes place on the far side of an 
> > > abstraction boundary.
> > >
> > 
> > 
> > Of the use cases for raw operations, it seems that the most 
> > defensible one is hand-crafted speculation. As in (for an atomic x)
> > 
> > int rawx = x.load_raw();
> > int nextx = f(rawx)
> > if (x.cas(rawx, nextx)) return;
> > // else maybe get a lock and do it the slow way
> > 
> > Since speculation is at the mercy of whatever happens, you 
> > don't expect very much. But you do expect that the load_raw 
> > actually reads x, and isn't completely optimized away and 
> > replaced by say, 0.
> > 
> > So perhaps the place to start is to require that raw loads 
> > and stores must at least actually occur (unlike ordinary 
> > variables where dead ones can be killed).
> > Except that loads could be killed if it is provable that 
> > across all executions of the program, the same value must be read.
> > 
> > This sounds sorta like rules for old-style C volatiles?
> > 
> > You can probably go a little further by somehow saying that 
> > the loads/stores must occur between any surrounding full 
> > barriers that might exist. As in, not moving across lock boundaries.
> > 
> > How much more do you need?
> > 
> The major concern is that the unordered atomic equivalent of
> 
> Thread 1: r1 = x; y = 1
> Thread 2: r2 = y; x = 1
> 
> should allow r1 = r2 = 1, while
> 
> Thread 1: r1 = x; y = r1
> Thread 2: r2 = y; x = r2
> 
> should not, since that would admit out-of-thin-air values, which we may
> not want to allow.
> 
> As Peter points out, it's not 100% clear that we want to prohibit "out
> of thin air" values.  It would still be nice to be able to prove that
> f() below returns 0, no matter what foo does.
> 
> int f() {
>    int x = 0;
>    int *y = foo();
> 
>    *y = 13;
>    return x;
> }
> 
> I think currently that's violated only for implementations of foo() that
> exhibit undefined semantics.  If we allow out of thin air values, foo()
> could return &x, and legitimate programs could return nonzero from f(),
> which argues that some tools might have to consider the possibility.
> Presumably, none of this could happen in a real implementation, which
> argues that this would be a weird behavior to standardize.

If foo() makes use of primitives similar to gcc's (admittedly nonstandard)
__builtin_frame_address() primitive, then foo() might well be able to
affect the value of x, and thus the value returned from f().  I would
certainly hope that people would avoid doing this sort of thing except
for debugging, but you did ask...

FWIW, here is a quick summary of the ordering approach taken by the
Linux kernel.

There are two types of directives:

o	Compiler control, done either through the "barrier()" directive
	or via the "memory" option to a gcc asm directive.

o	CPU control, via explicit memory-barrier directives:

	o	smp_mb() and mb() are full memory barriers.

	o	smp_rmb() and rmb() are memory barriers that
		are only guaranteed to segregate reads.

	o	smp_wmb() and wmb() are memory barriers that
		are only guaranteed to segregate writes.

	The "smp_" prefix causes code to be generated only in
	an SMP build.  So, for example, smp_wmb() would be used
	to order a pair of writes making up some sort of
	synchronization primitive, while wmb() would be used
	to order MMIO writes in device drivers.

	This distinction is due to the fact that a single CPU
	will always see its one cached references to normal
	memory in program order, while reordering can happen
	(even on a single-CPU system) from the viewpoint of
	an I/O device.

	(Note that this is not a complete list, but covers
	the most commonly used primitives.)

The CPU-control directives imply compiler-control directives, since
there is little point in forcing ordering at the CPU if one has not
also forced ordering in the compiler.  However, it -is- useful to
force ordering at the compiler only, for example, when writing code
that interacts with an interrupt handler.

Most uses of both the compiler- and SMP-CPU-control directives are
"buried" in higher-level primitives.  For example:

	Primitive	Uses
	---------	----
	barrier():	 564
	smp_mb():	 171
	smp_rmb():	  87
	smp_wmb():	 150

The unconditional CPU-control directives are used more heavily,
due to the fact that the many hundreds of device drivers in the
Linux kernel must make use of them:

	Primitive	Uses
	---------	----
	mb():		2480
	rmb():		 168
	wmb():		 587

These numbers might seem large, but there are more than 40,000 instances
of locking primitives in the Linux kernel, -not- counting cases where
a locking primitive use is "wrappered" in a small function or macro.

Applications that don't interact with either interrupts or signal
handlers might have less need for the raw CPU-control primitives.

						Thanx, Paul