[cpp-threads] Slightly revised memory model proposal (D2300)

Sun Jun 24 21:29:28 BST 2007

Hans wrote on 06/19/2007 08:35:09 PM:

> Thinking about this yet again, I think there may be another way out
> of this, by outlawing the flickering in all cases, even with relaxed
> operations on both sides.  That also avoids the synchronization
> elimination issues.
>
> This violates my earlier argument that
>
> p = &x;
> q = &x;
> ...
> while (...) {
>    // Loop does not store to potentially shared variables.
>   r1 = load_relaxed(p); // loads *p, i.e. x
>   r2 = load_relaxed(q); // loads *q, i.e. x }
>
> should allow load_relaxed operations to be reordered.
>
> Effectively we would be prohibiting compiler reordering of atomic
> operations, if the affected objects potentially aliased, even if
> both operations were loads.  Since the reordering is prevented only
> if the same location is affected, and cache coherence guarantees
> that at the hardware level anyway, I think this is purely a compiler
> reordering restriction.  The only case in which I could think of
> this mattering are for some numerical methods that are oblivious to
> races on floating point data, and hence use lots of atomic
> operations.  But I suspect the compiler's alias analysis is
> typically quite good in those cases.
>
> Opinions?

Even before considering the compiler, I don't think that this can work. The
hardware is free to reorder relaxed loads, even to the same memory
location. Particularly on PPC, there is nothing that would prevent the
second load from being performed ahead of the first, and introduce some
"apparent" flickering. You need the first load to be an acquire to ensure
that they are done in program order. Cache coherence applies to the order
on which the loads are executed, which is not necessarily their program
order.

In any case, preventing reordering of atomic loads by the compiler is
definitely a big hammer that will disable many compiler optimizations. I
don't think this can be justified by saying that atomic operations will not
be frequent enough to matter for performance.

Going back to the original problem of removability of locks, I do not think
that it is an issue, because on your example:

> r1 = load_relaxed(&x); fetch_and_add_acq_rel(&dead1, 0); r2 =
load_relaxed(&x); fetch_and_add_acq_rel(&dead2, 0); r3 = load_relaxed(&x);

the three loads of x should not be ordered wrt each other.

This may seem unintuitive, but it can be derived from the following:

If the two fetch_and_add ops were acquire operations, the three loads of x
would be unordered.
If the two fetch_and_add ops were release operations, the three loads of x
would be unordered.

So, if fetch_and_add_acq_rel is the union of the semantics of acq and rel,
the three loads of x should be unordered.

In particular, on the PPC implementation, for a sequence:

      store_relaxed(&x,1) ; fetch_and_add_acq_rel(&y, val);
load_relaxed(&z);

the store to x and load from z are unordered.

>
> Hans
>
>
>
> From: cpp-threads-bounces at decadentplace.org.uk [mailto:cpp-threads-
> bounces at decadentplace.org.uk] On Behalf Of Boehm, Hans
> Sent: Tuesday, June 19, 2007 4:21 PM
> To: C++ threads standardisation
> Subject: RE: [cpp-threads] Slightly revised memory model proposal (D2300)

> From:  Raul Silvera
> Hans wrote on 06/15/2007 12:09:39 PM:
>
> > Unfortunately, I think I posted some misinformation here, with respect
> > to flickering.  I believe the version of the example that I posted:
> >
> > > > Thread 1:
> > > > store_relaxed(&x, 1);
> > > >
> > > > Thread 2:
> > > > store_relaxed(&x, 2);
> > > >
> > > > Thread 3:
> > > > r1 = load_acquire(&x); (1)
> > > > r2 = load_acquire(&x); (2)
> > > > r3 = load_acquire(&x); (1)
> > > >
> > is already allowed to flicker under the D2300 rules.  And looking back
> > at Sarita's example, weakening this doesn't seem to help.  (The example
> > that we should really have been discussing would have had release
stores.
> > That's the one that's currently constrained by the modification order
> > rule.  And having that flicker does seem dubious.)
>
> I find this very troubling. From T3's point of view, it is just doing
acquire
> operations, and it is not expecting any flickering, regardless of which
stores
> are going to satisfy its loads.
>
> I was almost going to agree with you, and try to change this.  But
> this again runs into synchronization elimination issues, which seem
> central here.  If the "acquire"s in thread 3 mean anything without a
> matching "release" then, by similar reasons,
>
> r1 = load_relaxed(&x); r2 = load_relaxed(&x); r3 = load_relaxed(&x);
>
> can allow different outcomes from
>
> r1 = load_relaxed(&x); fetch_and_add_acq_rel(&dead1, 0); r2 =
> load_relaxed(&x); fetch_and_add_acq_rel(&dead2, 0); r3 =
load_relaxed(&x);
>
> which means that the dead fetch_and_adds can't be eliminated, which
> is very unfortunate.  It also means that I can't ever eliminate
> locks after thread inlining without understanding the whole program.
>
> I'm more and more inclined to do what Sarita was advocating anyway,
> which is to switch to a more conventional formulation of the memory
> model in which happens-before is
> just the transitive closure of the union of sequenced-before and
> synchronizes-with.  That makes it clearer that acquire and release
> only provide any guarantees if they occur in pairs.
>
> (The last proposal has another similar synchronization elimination
> issue with the "precedes" relation, which includes happens-before,
> but not sequenced-before.  I think we can also get rid of by moving
> back to a more conventional, Java-like, happens-before model.)
>
> My general feeling is that if we have a trade-off between
> synchronization elimination and more expressive low-level atomics,
> synchronization elimination should win, since it effects lock-based
> user code, which is bound to make up a much larger body of code than
> low-level atomics clients.
>
> And although I also find this a bit troubling, I'm still having a
> lot of trouble constructing a case in which this matters.
>
> Hans--
> cpp-threads mailing list
> cpp-threads at decadentplace.org.uk
> http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads

--
Raúl E. Silvera         IBM Toronto Lab   Team Lead, Toronto Portable
Optimizer (TPO)
Tel: 905-413-4188 T/L: 969-4188           Fax: 905-413-4854
D2/KC9/8200/MKM