[cpp-threads] seq_cst compare_exchange and store-load fencing

Fri Jan 2 20:07:55 GMT 2009

On Fri, Jan 02, 2009 at 08:38:23PM +0100, Alexander Terekhov wrote:
> < [] and BOLD annotations added to quoted text >
> 
> On Fri, Jan 2, 2009 at 6:55 PM, Paul E. McKenney
> <paulmck at linux.vnet.ibm.com> wrote:
> > On Fri, Jan 02, 2009 at 05:23:23PM +0100, Alexander Terekhov wrote:
> >> compare_exchange performs both load and (conditional) store. This
> >> leads to questions regarding store-load fencing for compare_exchange
> >> in seq_cst mode:
> >>
> >> Q1) Does it provide store-load fencing in the case of
> >>
> >>    A.store(relaxed|release) ... B.compare_exchange(..., seq_cst)
> >>
> >> regarding A's store and B's load (in either success or failure case of
> >> B's compare_exchange)?
> >
> > The proposed Power implementation provides this, but by accident.
> > I do not believe that this is required.  Now, if you do:
> >
> >    A.store(seq_cst) ... B.compare_exchange(..., seq_cst)
> >
> > Alternatively, place an atomic_thread_fence(seq_cst) between the
> > relaxed/release fence and the compare_exchange.
> 
> You probably meant:
> 
> "Alternatively, place an atomic_thread_fence(seq_cst) between the
> relaxed/release STORE [not fence] and the [relaxed] compare_exchange."

Indeed I did!  Good catch!!!

> > Then the proposed standard would guarantee the ordering.
> >
> >> Q2) Does it provide store-load fencing in the case of
> >>
> >>    B.compare_exchange(..., seq_cst) ... C.load(relaxed|acquire)
> >>
> >> regarding B's store and C's load (in success case of B's compare_exchange)?
> >
> > The proposed Power implementation provides a weak form of ordering
> > in this case, but again, only by accident.
> 
> By "weak form of ordering in this case" you probably meant NON
> sequentially consistent ordering in this case as in:
> 
> (From Book II):
> 
> "A successful stwcx. to a given location may complete
> before its store has been performed with respect to
> other processors and mechanisms."
> 
> Right?

The "bc;isync" ensures thatt the "stwcx." completes, but not the
corresponding store.

But again, if you want sequential consistency, you need to use seq_cst
throughout.  Failing to use seq_cst throughout, you should not expect
sequential consistent behavior.

> > To be guaranteed this [sequentially consistent] ordering
> >
> >    B.compare_exchange(..., seq_cst) ... C.load(seq_cst)
> >
> > As before, another approach is to place an atomic_thread_fence(seq_cst)
> > between the relaxed/release fence and the compare_exchange.
> 
> You probably meant:
> 
> "another approach is to place an atomic_thread_fence(seq_cst) between
> [relaxed] compare_exchange and the relaxed/acquire load."

Yep.  However you do it, if you want sequential consistency, you must
consistently use seq_cst operations.

> >> Under simple interpretation of "seq_cst" meaning "fully-fenced" the
> >> answer to both questions is "yes"...
> >
> > But that is not the definition of "seq_cst" in the proposed standard,
> > at least not as I read it.
> >
> >> Do you agree with the same outcome under the proposed C/C++ memory model?
> >>
> >> What is your reasoning in case you disagree?
> >
> > I appeal to the wording of section 29.1 of the proposed standard:
> >
> >        The enumeration memory_order specifies the detailed regular
> >        (non-atomic) memory synchronization order as defined in Clause
> >        1.10 and may provide for operation ordering.  Its enumerated
> >        values and their meanings are as follows:
> >
> >            — memory_order_relaxed: no operation orders memory.
> >            — memory_order_release, memory_order_acq_rel, and
> >              memory_order_seq_cst: a store operation performs a release
> >              operation on the affected memory location.
> >            — memory_order_consume: a load operation performs a consume
> >              operation on the affected memory location.
> >            — memory_order_acquire, memory_order_acq_rel, and
> >              memory_order_seq_cst: a load operation performs an acquire
> >              operation on the affected memory location.
> >
> >        There shall be a single total order S on all memory_order_seq_cst
> >        operations, consistent with the happens before order and
> >        modification orders for all affected locations, such that each
> >        memory_order_seq_cst operation that loads a value observes either
> >        the last preceding modification according to this order S, or
> >        the result of an operation that is not memory_order_seq_cst. [
> >        Note: Although it is not explicitly required that S include locks,
> >        it can always be extended to an order that does include lock and
> >        unlock operations, since the ordering between those is already
> >        included in the happens before ordering. — end note ]
> >
> > None of this requires that seq_cst operations be ordered with respect to
> > non-seq_cst operations except as required by acquire, consume, and
> > release semantics.
> 
> IOW, seq_cst means acq_rel (with further reduction to relaxed) except
> that it guarantees store-load fencing with respect to preceding and/or
> subsequent seq_cst, and only seq_cst... right?

I would argue that seq_cst means acq for loads, rel for stores, acq_rel
for RMW atomics, sequentially consistent interaction with seq_cst fences,
and a global total order for all seq_cst operations.

> Formalities regarding distinguishing
> 
> atomic_thread_fence(seq_cst), X.load(relaxed|acquire);
> 
> vs.
> 
> X.load(seq_cst);
> 
> why not state it in a more prominent place like 1.10 instead of 1K
> pages below it? ;-)

Because you have to go 1K pages below it even to find out about the
memory_order enumeration.  ;-)

> > Now I personally have no objection to making seq_cst operations more
> > expensive, but others might.  ;-)
> 
> I suspect that not making seq_cst operations more expensive (according
> to simple "fully-fenced" reasoning) will result in quite a lot of
> incorrect code.
> 
> We'll see.

Pretty simple rule: "If you want seq_cst operations to be guaranteed
ordered WRT non-seq_cst operations, use atomic_thread_fence(seq_cst)
to segregate the operations."

							Thanx, Paul