[cpp-threads] std::atomic<> in acquire-release mode and write atomicity

Tue Dec 16 18:05:40 GMT 2008

On Tue, Dec 16, 2008 at 06:12:06PM +0100, Alexander Terekhov wrote:
> On Tue, Dec 16, 2008 at 5:56 PM, Paul E. McKenney
> <paulmck at linux.vnet.ibm.com> wrote:
> [...]
> >> P1: Y.store(1, release);
> >> P2: if( Y.load(acquire) == 1 ) { Z.store(1, release); }
> >> P3: if( Z.load(acquire) == 1 ) { assert( Y.load(acquire) == 1 ); }
> >
> > Well, that does put a different light on it.  ;-)
> >
> > OK, P1's release has no effect, as there is no prior operation.
> >
> > P2's store-release to Z ensures that P2's load from Y is performed
> > WRT all other threads before P2's store to Z.  A-cumulativity
> > ensures that if P2's load from Y sees P1's store, then P1's store
> > to Y is performed WRT all threads before P2's store to Z.  These
> > are both stores, and hence are "applicable" to a store-release,
> > which on PowerPC turns into an lwsync instruction.
> >
> > If P3's load from Z sees P2's store, then by B-cumulativity, P3's
> > load from Y is P2's lwsync's B-set -except- that a prior store and
> > following load is not "applicable" in the case of lwsync.  So, let's
> > turn to P3's acquire operation, which becomes a conditional-branch/isync
> > combination.  This means that P3's load from Z is performed before
> > P3's load from Y WRT all threads.  Because P3's load from Z returned
> > 1, it must have been performed after P2's store to Z WRT P3.
> >
> > But P1's store to Y was performed WRT all threads before P2's
> > store to Z, as noted earlier.  Therefore, the assert is required to
> > see the new value of Y, and thus cannot fail.  I think, anyway.
> >
> > Hey, you asked!!!  ;-)
> 
> Are you in disagreement with your own paper? ;-)

What?  Haven't you heard of quantum computing???  ;-)

Seriously, you found an error in n2745.html.  But Peter Dimov beat you
to it a few months ago, see below.

> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html
> 
> "... permits the following counter-intuitive sequence of events, with
> all variables initially zero, and results of loads in square brackets
> following the load:
> 
> 1. CPU 0: x=1
> 2. CPU 1: r1=x [1]
> 3. CPU 1: lwsync
> 4. CPU 1: y=1
> 5. CPU 2: r2=y [1]
> 6. CPU 2: bc;isync
> 7. CPU 2: r3=x [0]

This was indeed one of the bugs that was fixed in the newer version,
which instead reads as follows:

   1. CPU 0: x=1
   2. CPU 1: r1=x [1]
   3. CPU 1: bc;isync
   4. CPU 1: y=1
   5. CPU 2: r2=y [1]
   6. CPU 2: bc;isync
   7. CPU 2: r3=x [0] 

The "lwsync" on CPU 1 has been replaced by a "bc;isync" sequence.
This removes cumulativity from the picture, so that CPU 0's "x=1"
is no longer guaranteed to be performed WRT all threads before
CPU 1's "y=1".

As noted in an earlier email, the new version may be found at:

http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2008.09.20a.html

> This sequence of events is more likely to occur on systems where CPUs
> 0 and 1 are closely related, for example, when CPUs 0 and 1 are
> hardware threads in one core and CPU 2 is a hardware thread in another
> core.
> 
> [...]
> 
> Cumulativity does not come into play here because prior stores (CPU
> 0's store to x) and subsequent loads (CPU 2's load from x) are not
> applicable to the lwsync instruction. However, if the lwsync were to
> be replaced with a hwsync, the outcome shows above would be
> impossible. "

In any case, good eyes!!!  I clearly need to send you more stuff for
review!  ;-)

						Thanx, Paul