[cpp-threads] std::atomic<> in acquire-release mode and write atomicity

Wed Dec 17 16:27:40 GMT 2008

On Tue, Dec 16, 2008 at 7:05 PM, Paul E. McKenney
<paulmck at linux.vnet.ibm.com> wrote:
> On Tue, Dec 16, 2008 at 06:12:06PM +0100, Alexander Terekhov wrote:
>> On Tue, Dec 16, 2008 at 5:56 PM, Paul E. McKenney
>> <paulmck at linux.vnet.ibm.com> wrote:
>> [...]
>> >> P1: Y.store(1, release);
>> >> P2: if( Y.load(acquire) == 1 ) { Z.store(1, release); }
>> >> P3: if( Z.load(acquire) == 1 ) { assert( Y.load(acquire) == 1 ); }
>> >
>> > Well, that does put a different light on it.  ;-)
>> >
>> > OK, P1's release has no effect, as there is no prior operation.
>> >
>> > P2's store-release to Z ensures that P2's load from Y is performed
>> > WRT all other threads before P2's store to Z.  A-cumulativity
>> > ensures that if P2's load from Y sees P1's store, then P1's store
>> > to Y is performed WRT all threads before P2's store to Z.  These
>> > are both stores, and hence are "applicable" to a store-release,
>> > which on PowerPC turns into an lwsync instruction.
>> >
>> > If P3's load from Z sees P2's store, then by B-cumulativity, P3's
>> > load from Y is P2's lwsync's B-set -except- that a prior store and
>> > following load is not "applicable" in the case of lwsync.  So, let's
>> > turn to P3's acquire operation, which becomes a conditional-branch/isync
>> > combination.  This means that P3's load from Z is performed before
>> > P3's load from Y WRT all threads.  Because P3's load from Z returned
>> > 1, it must have been performed after P2's store to Z WRT P3.
>> >
>> > But P1's store to Y was performed WRT all threads before P2's
>> > store to Z, as noted earlier.  Therefore, the assert is required to
>> > see the new value of Y, and thus cannot fail.  I think, anyway.
>> >
>> > Hey, you asked!!!  ;-)
>>
>> Are you in disagreement with your own paper? ;-)
>
> What?  Haven't you heard of quantum computing???  ;-)
>
> Seriously, you found an error in n2745.html.  But Peter Dimov beat you
> to it a few months ago, see below.
>
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html
>>
>> "... permits the following counter-intuitive sequence of events, with
>> all variables initially zero, and results of loads in square brackets
>> following the load:
>>
>> 1. CPU 0: x=1
>> 2. CPU 1: r1=x [1]
>> 3. CPU 1: lwsync
>> 4. CPU 1: y=1
>> 5. CPU 2: r2=y [1]
>> 6. CPU 2: bc;isync
>> 7. CPU 2: r3=x [0]
>
> This was indeed one of the bugs that was fixed in the newer version,
> which instead reads as follows:
>
>   1. CPU 0: x=1
>   2. CPU 1: r1=x [1]
>   3. CPU 1: bc;isync
>   4. CPU 1: y=1
>   5. CPU 2: r2=y [1]
>   6. CPU 2: bc;isync
>   7. CPU 2: r3=x [0]
>
> The "lwsync" on CPU 1 has been replaced by a "bc;isync" sequence.
> This removes cumulativity from the picture, so that CPU 0's "x=1"
> is no longer guaranteed to be performed WRT all threads before
> CPU 1's "y=1".
>
> As noted in an earlier email, the new version may be found at:
>
> http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2008.09.20a.html

Okay, the newest version

http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2008.12.16a.html

states:

       This might seem at first glance to prohibit the above code
       sequence, however, please keep in mind that CPU 1's isync is not
       a cumulative memory barrier, and therefore does not guarantee
       that CPUs that see CPU 1's store to y will necessarily see
       CPU 0's store to x. Note that it is sufficient to replace
       CPU 1's bc;isync with an lwsync in order to prevent CPU 2's
       load from x from returning zero, but the proof cannot rely
       on B-cumulativity. B-cumulativity will not come into play
       here because prior stores (CPU 0's store to x) and subsequent
       loads (CPU 2's load from x) are not applicable to the lwsync
       instruction. Instead, we note that CPU 0's store to x will be
       performed before CPU 1's store to y with respect to all CPUs,
       due to CPU 1's lwsync instruction.  Furthermore, CPU 2's bc;isync
       instruction ensures that CPU 2's load from y is performed before
       CPU 2's load from x with respect to all CPUs.  Therefore, if
       CPU 2's load from y returns 1 (the new value), then CPU 2's
       subsequent load from x must also return 1 (the new value),
       given that the store to x was performed before the store to y.

       However, if CPU 1's bc;isync were to be replaced with a hwsync,
       the proof can rely on simple B-cumulativity, because prior stores
       and subsequent loads are applicable in the case of hwsync.

Now let's consider the following example:

P1: Y.store(1, release);
P2: Z.store(1, release); fence(); a = Y.load(acquire);
P3: b = Y.load(acquire); c = Z.load(acquire);

a=0, b=1, c=0 violates write atomicity.

We can also change it to

P1: Y.store(1, seq_cst);
P2: Z.store(1, seq_cst); fence(); a = Y.load(acquire);
P3: b = Y.load(acquire); c = Z.load(acquire);

and that would still violate write atomicity, I'm afraid.

What do you think about that?

BTW, "Store Seq Cst" in the  mapping doesn't look strong enough to
me... how about "lwsync; st; hwsync"?

TIA.

regards,
alexander.