[cpp-threads] std::atomic<> in acquire-release mode and write atomicity

Wed Dec 17 19:05:09 GMT 2008

On Wed, Dec 17, 2008 at 6:47 PM, Paul E. McKenney
<paulmck at linux.vnet.ibm.com> wrote:
[...]
>> Now let's consider the following example:
>>
>> P1: Y.store(1, release);
>> P2: Z.store(1, release); fence(); a = Y.load(acquire);
>> P3: b = Y.load(acquire); c = Z.load(acquire);
>
> Why make the two stores "release" rather than "relaxed", given that
> there is nothing preceding them to be released?

Because we are in release-acquire mode for atomic<>. (See the thread's
subject. ;-) )

Under RCtso mode (e.g. Itanium) release stores provide write atomicity
and a=0, b=1, c=0 outcome is not allowed, for example.

>
>> a=0, b=1, c=0 violates write atomicity.
>
> Or this outcome indicates that some reordering has occured, not?

Well, it indicates that P1's store is first performed with respect to
P3, then P3 performs its two loads (resulting in b=1 and c=0), then P2
performs it's store and load (resulting in a=0) and only afterwards P1
performs its store with respect to P2 (IOW P1's store is non-atomic).

> Such reordering would not be surprising, given that you are not
> using SC operations throughout.
>
> Anyway, assuming that "fence();" is an SC fence, since you do not state
> otherwise:

Well, it's meant to be a store-load fence.

>
> P2's fence compiles on PowerPC to a hwsync, so that P2's store to Z
> is performed before P2's load from Y WRT each processor.  The fact that
> a==0 means that P2's load from Y was performed before P1's store to Y
> WRT P2.
>
> The fact that b=1 means that P3's load from Y was performed before P1's
> store to Y WRT P3.  P3's acquire-load from Y implies a "bc;isync"
> between its pair of loads, so that these two loads are performed in
> order with respect to each processor.
>
> But I don't see anything preventing the outcome you show.  If you wanted
> that outcome to be prohibited, you would need an SC fence between P3's
> two loads.

How is this reflected in the standard?

Note that this might interest x86 folks as well:

http://www.cl.cam.ac.uk/~pes20/weakmemory/popl09.pdf

"2.12 Fences

The x86 also includes fence instructions, or memory bar-
riers, LFENCE, SFENCE, and MFENCE. For the co-
herent write-back fragment that we consider, without non-
temporal operations, we understand the first two to be (per-
haps costly) no-ops. For MFENCE, the documentation is
less clear, and so we did not include it in our HOL model.
IWP [22] does not mention it, while the Intel SDM [5,
vol.2A,p3-624;vol.3A,§7.2.5] is ambiguous (the text could be
read as asserting that MFENCEs of different processors are
globally serialised). The most conservative (weakest) plausi-
ble semantics is that an MFENCE simply ensures that pro-
gram order is preserved around it, preventing the reordering
of a load before a store that we saw in Test iwp2.3.a/amd4 of
§2.4. Test amd5 (like iwp2.3.a/amd4 but with an MFENCE
after each store) confirms this holds on AMD64, and infor-
mal discussion suggests it also does on Intel 64/IA-32. It
would be easy to add this to the model, strengthening the
preserved program order definition of §2.4.
AMD64 [4] includes one final test, amd10 (an analogue
of iwp2.3.1/amd4) which shows a strictly stronger seman-
tics, but just how much stronger is unclear (consider, for
example, analogues of Test amd6 in Fig. 2 with one or more
MFENCEs)."

>
> I don't see that the C++ memory model prohibits this outcome, either.
>
>> We can also change it to
>>
>> P1: Y.store(1, seq_cst);
>> P2: Z.store(1, seq_cst); fence(); a = Y.load(acquire);
>> P3: b = Y.load(acquire); c = Z.load(acquire);
>>
>> and that would still violate write atomicity, I'm afraid.
>
> Translating "violate write atomicity" to "permit reordering", I would
> agree, given that you did not introduce SC into P3's operations.  ;-)
>
> Introducing SC to P1 is a no-op, because P1 doesn't do any other
> operations, so the ordering cannot have any effect.  Introducing SC to
> P1 is also a no-op, since the pre-existing fence already enforced SC.

This may be so on Power but consider for example x86 mapping:

http://www.justsoftwaresolutions.co.uk/threading/intel-memory-ordering-and-c++-memory-model.html

XCHG is said to provide "write atomicity" in the latest Intel and AMD specs.

>
>> What do you think about that?
>
> I think that if you want SC, you will need to convert all your atomic
> operations to SC atomic operations.  I think further that if you choose
> to use some non-SC atomic operations, you might well observe some non-SC
> behavior.
>
> But perhaps I am misunderstanding your question.  Do you have a use
> case for the above sequence of operations that I am failing to
> appreciate?

I'm just trying to grok the degree of write atomicity provided by
atomic<> in release-acquire mode according to the standard.

>
>> BTW, "Store Seq Cst" in the  mapping doesn't look strong enough to
>> me... how about "lwsync; st; hwsync"?
>
> I am not seeing this.  Could you provide an example where some combination
> of the PowerPC SC operations fails to behave in an SC fashion?

I meant something along the line of

P1: Y.store(1, seq_cst);
P2: Z.store(1, seq_cst); a = Y.load(relaxed);
P3: b = Y.load(relaxed); fence(seq_cst); c = Z.load(relaxed);

To me, seq_cst stores shall be fully-fenced (providing trailing fence as well).

regards,
alexander.