[cpp-threads] Slightly revised memory model proposal (D2300)

Mon Jun 25 20:57:55 BST 2007

Metacomment: We submitted N2300, the concurrency memory model paper, more or less as it was posted here, though with some more editorial changes suggested by Clark.  It should appear on the committee site (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/) shortly.  I'm still aiming for consensus on something similar to it by the Toronto meeting.  Thus it would be good to resolve issues like the one Raul brought up, ideally mostly before the meeting.

> -----Original Message-----
> From:  Raul Silvera
> 
> Hans wrote on 06/19/2007 08:35:09 PM:
> 
> > Thinking about this yet again, I think there may be another 
> way out of 
> > this, by outlawing the flickering in all cases, even with relaxed 
> > operations on both sides.  That also avoids the synchronization 
> > elimination issues.
> >
> > This violates my earlier argument that
> >
> > p = &x;
> > q = &x;
> > ...
> > while (...) {
> >    // Loop does not store to potentially shared variables.
> >   r1 = load_relaxed(p); // loads *p, i.e. x
> >   r2 = load_relaxed(q); // loads *q, i.e. x }
> >
> > should allow load_relaxed operations to be reordered.
> >
> > Effectively we would be prohibiting compiler reordering of atomic 
> > operations, if the affected objects potentially aliased, 
> even if both 
> > operations were loads.  Since the reordering is prevented 
> only if the 
> > same location is affected, and cache coherence guarantees 
> that at the 
> > hardware level anyway, I think this is purely a compiler reordering 
> > restriction.  The only case in which I could think of this 
> mattering 
> > are for some numerical methods that are oblivious to races 
> on floating 
> > point data, and hence use lots of atomic operations.  But I suspect 
> > the compiler's alias analysis is typically quite good in 
> those cases.
> >
> > Opinions?
> 
> Even before considering the compiler, I don't think that this 
> can work. The hardware is free to reorder relaxed loads, even 
> to the same memory location. Particularly on PPC, there is 
> nothing that would prevent the second load from being 
> performed ahead of the first, and introduce some "apparent" 
> flickering. You need the first load to be an acquire to 
> ensure that they are done in program order. Cache coherence 
> applies to the order on which the loads are executed, which 
> is not necessarily their program order.
I'm having trouble reconciling this with section 1.3.6 in PowerPC Book II, version 2.01.  Can you help?

It states, for example: "That is, a processor or mechanism can
never load a "newer" value first and then, later, load
an "older" value."

I just spent a bit of time staring at the Itanium memory model paper again, and I believe that it cannot happen there.  (As seems to be the usual case, the argument is a bit more subtle than I would like.)

My general mental model, which I think the Itanium paper makes fairly explicit, is that writes to a single location become visible in the order in which processors obtain exclusive ownership of the cache line.  A write may become locally visible earlier because it is locally buffered.  But in that case the local processor will consistently see the local write until it becomes globally visible to everyone.  In this way, even the local processors will see writes in an order that's consistent with the global one.

> 
> In any case, preventing reordering of atomic loads by the 
> compiler is definitely a big hammer that will disable many 
> compiler optimizations. I don't think this can be justified 
> by saying that atomic operations will not be frequent enough 
> to matter for performance.
Remember that we're talking about compiler reordering of atomic loads from (at least potentially) the same memory location.  Do you have an example where that might matter?

> 
> Going back to the original problem of removability of locks, 
> I do not think that it is an issue, because on your example:
> 
> > r1 = load_relaxed(&x); fetch_and_add_acq_rel(&dead1, 0); r2 =
> load_relaxed(&x); fetch_and_add_acq_rel(&dead2, 0); r3 = 
> load_relaxed(&x);
> 
> the three loads of x should not be ordered wrt each other.
In the current formulation, they are indeed no longer ordered in any real sense.  They happen before each other, because they are sequenced, and hence appear ordered to that thread, but there is no way that another thread can observe the order.  And the behavior is no different than wihtou the fetch_and_add operations.  All of which I think is correct.

The concern dealt with a version of the model that we've at least tentatively abandoned.  We should reconsider it if we have to go back to something closer to that model.

Hans

> 
> This may seem unintuitive, but it can be derived from the following:
> 
> If the two fetch_and_add ops were acquire operations, the 
> three loads of x would be unordered.
> If the two fetch_and_add ops were release operations, the 
> three loads of x would be unordered.
> 
> So, if fetch_and_add_acq_rel is the union of the semantics of 
> acq and rel, the three loads of x should be unordered.
> 
> In particular, on the PPC implementation, for a sequence:
> 
>       store_relaxed(&x,1) ; fetch_and_add_acq_rel(&y, val); 
> load_relaxed(&z);
> 
> the store to x and load from z are unordered.
> 
> >
> > Hans
> >
> >
> >
> > From: cpp-threads-bounces at decadentplace.org.uk [mailto:cpp-threads- 
> > bounces at decadentplace.org.uk] On Behalf Of Boehm, Hans
> > Sent: Tuesday, June 19, 2007 4:21 PM
> > To: C++ threads standardisation
> > Subject: RE: [cpp-threads] Slightly revised memory model proposal 
> > (D2300)
> 
> > From:  Raul Silvera
> > Hans wrote on 06/15/2007 12:09:39 PM:
> >
> > > Unfortunately, I think I posted some misinformation here, with 
> > > respect to flickering.  I believe the version of the 
> example that I posted:
> > >
> > > > > Thread 1:
> > > > > store_relaxed(&x, 1);
> > > > >
> > > > > Thread 2:
> > > > > store_relaxed(&x, 2);
> > > > >
> > > > > Thread 3:
> > > > > r1 = load_acquire(&x); (1)
> > > > > r2 = load_acquire(&x); (2)
> > > > > r3 = load_acquire(&x); (1)
> > > > >
> > > is already allowed to flicker under the D2300 rules.  And looking 
> > > back at Sarita's example, weakening this doesn't seem to 
> help.  (The 
> > > example that we should really have been discussing would have had 
> > > release
> stores.
> > > That's the one that's currently constrained by the modification 
> > > order rule.  And having that flicker does seem dubious.)
> >
> > I find this very troubling. From T3's point of view, it is 
> just doing
> acquire
> > operations, and it is not expecting any flickering, regardless of 
> > which
> stores
> > are going to satisfy its loads.
> >
> > I was almost going to agree with you, and try to change this.  But 
> > this again runs into synchronization elimination issues, which seem 
> > central here.  If the "acquire"s in thread 3 mean anything 
> without a 
> > matching "release" then, by similar reasons,
> >
> > r1 = load_relaxed(&x); r2 = load_relaxed(&x); r3 = load_relaxed(&x);
> >
> > can allow different outcomes from
> >
> > r1 = load_relaxed(&x); fetch_and_add_acq_rel(&dead1, 0); r2 = 
> > load_relaxed(&x); fetch_and_add_acq_rel(&dead2, 0); r3 =
> load_relaxed(&x);
> >
> > which means that the dead fetch_and_adds can't be 
> eliminated, which is 
> > very unfortunate.  It also means that I can't ever eliminate locks 
> > after thread inlining without understanding the whole program.
> >
> > I'm more and more inclined to do what Sarita was advocating anyway, 
> > which is to switch to a more conventional formulation of the memory 
> > model in which happens-before is just the transitive closure of the 
> > union of sequenced-before and synchronizes-with.  That makes it 
> > clearer that acquire and release only provide any 
> guarantees if they 
> > occur in pairs.
> >
> > (The last proposal has another similar synchronization elimination 
> > issue with the "precedes" relation, which includes 
> happens-before, but 
> > not sequenced-before.  I think we can also get rid of by 
> moving back 
> > to a more conventional, Java-like, happens-before model.)
> >
> > My general feeling is that if we have a trade-off between 
> > synchronization elimination and more expressive low-level atomics, 
> > synchronization elimination should win, since it effects lock-based 
> > user code, which is bound to make up a much larger body of 
> code than 
> > low-level atomics clients.
> >
> > And although I also find this a bit troubling, I'm still 
> having a lot 
> > of trouble constructing a case in which this matters.
> >
> > Hans--
> > cpp-threads mailing list
> > cpp-threads at decadentplace.org.uk
> > http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads
> 
> 
> 
> --
> Raúl E. Silvera         IBM Toronto Lab   Team Lead, Toronto Portable
> Optimizer (TPO)
> Tel: 905-413-4188 T/L: 969-4188           Fax: 905-413-4854
> D2/KC9/8200/MKM
> 
> 
> --
> cpp-threads mailing list
> cpp-threads at decadentplace.org.uk
> http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads
>