[cpp-threads] Alternatives to SC

Tue Jan 16 17:46:41 GMT 2007

On Tue, Jan 16, 2007 at 05:53:32PM +0100, Alexander Terekhov wrote:
> On 1/16/07, Raul Silvera <rauls at ca.ibm.com> wrote:
> [...]
> >> > Well, lwarx-stwcx-on-P3 version aside for a moment, at least one
> >> > "cumulative" barrier is needed on P2/P3 (Book II shows two) I think.
> >
> >On this particular example, cumulativity is only needed on P2 (lwsync
> 
> P2: if (x == 1) lwsync(), y = 2;
> 
> Load x is in "A" and store y = 2 is in "B" (A performs before B...
> note that {lw}sync() is NOT needed for only that ordering because it
> is provided by control dependency).
> 
> >provides a cumulative ordering). Basically you need P1's x=1 store to be
> >ordered wrt P3's y=2.
> 
> You mean that we need P1's x=1 store with respect to P3 to be ordered
> before P3's load x after P3 sees P2's y=2 store. Right?
> 
> So that P1's x=1 store will be in "A" and P3's load x will be in "B"...
> 
> (quoting and editing the definition of "cumulative" barrier from Book II)
> 
> "The ordering done by a memory barrier is said to be "cumulative" if
> it also orders storage accesses that are performed by processors and
> mechanisms other than P2, as follows.
> 
> ? A includes all applicable storage accesses by any such processor or
> mechanism that have been performed with respect to P2 before the
> memory barrier is created. (A includes P1's store x=1 with respect to
> any other processor... i.e. P3 including)
> 
> ? B includes all applicable storage accesses by any such processor or
> mechanism that are performed after a Load instruction executed by that
> processor or mechanism has returned the value stored by a store that
> is in B. (B includes P3's load x performed after P3's load y has
> returned value 2 stored by P2.)
> 
> Correct?
> 
> >Cumulativity is not needed on P3: the order of the
> 
> Okay, agreed.
> 
> But given that for "[s]tores to storage that is Memory Coherence
> Required and is neither Write Through Required nor Caching Inhibited"
> eieio is also cumulative...
> 
> P1: x = 1;
> P2: if (x == 1) eieio(), y = 2;
> P3: if (y == 2) isync(), assert(x == 1);
> 
> Would also do it, I suppose. ("Storage that is Write Through Required
> or Caching Inhibited is not intended to be used for general-purpose
> programming.")
> 
> Now, when are we going to have totally naked cumulative fence (a beast
> without extra features of eieio, {lw}sync, or whatever) for
> general-purpose programming on Power? ;-)

This discussion is giving me a much greater appreciation for the
Linux-kernel approach of defining a few lowest-common-denominator
memory-barrier primitives for use across all platforms.  ;-)

For POWER, the definitions are as follows:

#define mb()   __asm__ __volatile__ ("sync" : : : "memory")
#define rmb()  __asm__ __volatile__ (__stringify(LWSYNC) : : : "memory")
#define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
#define read_barrier_depends()  do { } while(0)
#define barrier() __asm__ __volatile__("": : :"memory")

The "__stringify(LWSYNC)" expands to either "sync" or "lwsync", depending
on whether the kernel is being compiled only for newer platforms or
whether it must also run on older platforms that do not have the "lwsync"
instruction.  The "memory" attribute disables compiler optimizations
that might move memory references across the memory-barrier primitive.

If the kernel is compiled with SMP support, we have the following:

#define smp_mb()	mb()
#define smp_rmb()	rmb()
#define smp_wmb()	__asm__ __volatile__ ("eieio" : : : "memory")
#define smp_read_barrier_depends()	read_barrier_depends()

If the kernel is instead compiled to run only on uniprocessors:

#define smp_mb()	barrier()
#define smp_rmb()	barrier()
#define smp_wmb()	barrier()
#define smp_read_barrier_depends()	do { } while(0)

There is no architecture-independent provision for control dependencies
implying memory barriers.

						Thanx, Paul