[cpp-threads] std::atomic<> in acquire-release mode and write atomicity

Wed Dec 17 17:47:43 GMT 2008

On Wed, Dec 17, 2008 at 05:27:40PM +0100, Alexander Terekhov wrote:
> On Tue, Dec 16, 2008 at 7:05 PM, Paul E. McKenney
> <paulmck at linux.vnet.ibm.com> wrote:
> > On Tue, Dec 16, 2008 at 06:12:06PM +0100, Alexander Terekhov wrote:
> >> On Tue, Dec 16, 2008 at 5:56 PM, Paul E. McKenney
> >> <paulmck at linux.vnet.ibm.com> wrote:
> >> [...]
> >> >> P1: Y.store(1, release);
> >> >> P2: if( Y.load(acquire) == 1 ) { Z.store(1, release); }
> >> >> P3: if( Z.load(acquire) == 1 ) { assert( Y.load(acquire) == 1 ); }
> >> >
> >> > Well, that does put a different light on it.  ;-)
> >> >
> >> > OK, P1's release has no effect, as there is no prior operation.
> >> >
> >> > P2's store-release to Z ensures that P2's load from Y is performed
> >> > WRT all other threads before P2's store to Z.  A-cumulativity
> >> > ensures that if P2's load from Y sees P1's store, then P1's store
> >> > to Y is performed WRT all threads before P2's store to Z.  These
> >> > are both stores, and hence are "applicable" to a store-release,
> >> > which on PowerPC turns into an lwsync instruction.
> >> >
> >> > If P3's load from Z sees P2's store, then by B-cumulativity, P3's
> >> > load from Y is P2's lwsync's B-set -except- that a prior store and
> >> > following load is not "applicable" in the case of lwsync.  So, let's
> >> > turn to P3's acquire operation, which becomes a conditional-branch/isync
> >> > combination.  This means that P3's load from Z is performed before
> >> > P3's load from Y WRT all threads.  Because P3's load from Z returned
> >> > 1, it must have been performed after P2's store to Z WRT P3.
> >> >
> >> > But P1's store to Y was performed WRT all threads before P2's
> >> > store to Z, as noted earlier.  Therefore, the assert is required to
> >> > see the new value of Y, and thus cannot fail.  I think, anyway.
> >> >
> >> > Hey, you asked!!!  ;-)
> >>
> >> Are you in disagreement with your own paper? ;-)
> >
> > What?  Haven't you heard of quantum computing???  ;-)
> >
> > Seriously, you found an error in n2745.html.  But Peter Dimov beat you
> > to it a few months ago, see below.
> >
> >> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html
> >>
> >> "... permits the following counter-intuitive sequence of events, with
> >> all variables initially zero, and results of loads in square brackets
> >> following the load:
> >>
> >> 1. CPU 0: x=1
> >> 2. CPU 1: r1=x [1]
> >> 3. CPU 1: lwsync
> >> 4. CPU 1: y=1
> >> 5. CPU 2: r2=y [1]
> >> 6. CPU 2: bc;isync
> >> 7. CPU 2: r3=x [0]
> >
> > This was indeed one of the bugs that was fixed in the newer version,
> > which instead reads as follows:
> >
> >   1. CPU 0: x=1
> >   2. CPU 1: r1=x [1]
> >   3. CPU 1: bc;isync
> >   4. CPU 1: y=1
> >   5. CPU 2: r2=y [1]
> >   6. CPU 2: bc;isync
> >   7. CPU 2: r3=x [0]
> >
> > The "lwsync" on CPU 1 has been replaced by a "bc;isync" sequence.
> > This removes cumulativity from the picture, so that CPU 0's "x=1"
> > is no longer guaranteed to be performed WRT all threads before
> > CPU 1's "y=1".
> >
> > As noted in an earlier email, the new version may be found at:
> >
> > http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2008.09.20a.html
> 
> Okay, the newest version
> 
> http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2008.12.16a.html
> 
> states:
> 
>        This might seem at first glance to prohibit the above code
>        sequence, however, please keep in mind that CPU 1's isync is not
>        a cumulative memory barrier, and therefore does not guarantee
>        that CPUs that see CPU 1's store to y will necessarily see
>        CPU 0's store to x. Note that it is sufficient to replace
>        CPU 1's bc;isync with an lwsync in order to prevent CPU 2's
>        load from x from returning zero, but the proof cannot rely
>        on B-cumulativity. B-cumulativity will not come into play
>        here because prior stores (CPU 0's store to x) and subsequent
>        loads (CPU 2's load from x) are not applicable to the lwsync
>        instruction. Instead, we note that CPU 0's store to x will be
>        performed before CPU 1's store to y with respect to all CPUs,
>        due to CPU 1's lwsync instruction.  Furthermore, CPU 2's bc;isync
>        instruction ensures that CPU 2's load from y is performed before
>        CPU 2's load from x with respect to all CPUs.  Therefore, if
>        CPU 2's load from y returns 1 (the new value), then CPU 2's
>        subsequent load from x must also return 1 (the new value),
>        given that the store to x was performed before the store to y.
> 
>        However, if CPU 1's bc;isync were to be replaced with a hwsync,
>        the proof can rely on simple B-cumulativity, because prior stores
>        and subsequent loads are applicable in the case of hwsync.
> 
> Now let's consider the following example:
> 
> P1: Y.store(1, release);
> P2: Z.store(1, release); fence(); a = Y.load(acquire);
> P3: b = Y.load(acquire); c = Z.load(acquire);

Why make the two stores "release" rather than "relaxed", given that
there is nothing preceding them to be released?

> a=0, b=1, c=0 violates write atomicity.

Or this outcome indicates that some reordering has occured, not?
Such reordering would not be surprising, given that you are not
using SC operations throughout.

Anyway, assuming that "fence();" is an SC fence, since you do not state
otherwise:

P2's fence compiles on PowerPC to a hwsync, so that P2's store to Z
is performed before P2's load from Y WRT each processor.  The fact that
a==0 means that P2's load from Y was performed before P1's store to Y
WRT P2.

The fact that b=1 means that P3's load from Y was performed before P1's
store to Y WRT P3.  P3's acquire-load from Y implies a "bc;isync"
between its pair of loads, so that these two loads are performed in
order with respect to each processor.

But I don't see anything preventing the outcome you show.  If you wanted
that outcome to be prohibited, you would need an SC fence between P3's
two loads.

I don't see that the C++ memory model prohibits this outcome, either.

> We can also change it to
> 
> P1: Y.store(1, seq_cst);
> P2: Z.store(1, seq_cst); fence(); a = Y.load(acquire);
> P3: b = Y.load(acquire); c = Z.load(acquire);
> 
> and that would still violate write atomicity, I'm afraid.

Translating "violate write atomicity" to "permit reordering", I would
agree, given that you did not introduce SC into P3's operations.  ;-)

Introducing SC to P1 is a no-op, because P1 doesn't do any other
operations, so the ordering cannot have any effect.  Introducing SC to
P1 is also a no-op, since the pre-existing fence already enforced SC.

> What do you think about that?

I think that if you want SC, you will need to convert all your atomic
operations to SC atomic operations.  I think further that if you choose
to use some non-SC atomic operations, you might well observe some non-SC
behavior.

But perhaps I am misunderstanding your question.  Do you have a use
case for the above sequence of operations that I am failing to
appreciate?

> BTW, "Store Seq Cst" in the  mapping doesn't look strong enough to
> me... how about "lwsync; st; hwsync"?

I am not seeing this.  Could you provide an example where some combination
of the PowerPC SC operations fails to behave in an SC fashion?

						Thanx, Paul

> TIA.
> 
> regards,
> alexander.