[cpp-threads] Slightly revised memory model proposal (D2300)

Wed Jun 20 01:35:09 BST 2007

Thinking about this yet again, I think there may be another way out of
this, by outlawing the flickering in all cases, even with relaxed
operations on both sides.  That also avoids the synchronization
elimination issues.

This violates my earlier argument that

p = &x;
q = &x;
...
while (...) {
   // Loop does not store to potentially shared variables.
  r1 = load_relaxed(p); // loads *p, i.e. x
  r2 = load_relaxed(q); // loads *q, i.e. x }

should allow load_relaxed operations to be reordered.

Effectively we would be prohibiting compiler reordering of atomic
operations, if the affected objects potentially aliased, even if both
operations were loads.  Since the reordering is prevented only if the
same location is affected, and cache coherence guarantees that at the
hardware level anyway, I think this is purely a compiler reordering
restriction.  The only case in which I could think of this mattering are
for some numerical methods that are oblivious to races on floating point
data, and hence use lots of atomic operations.  But I suspect the
compiler's alias analysis is typically quite good in those cases.

Opinions?

Hans 

________________________________

	From: cpp-threads-bounces at decadentplace.org.uk
[mailto:cpp-threads-bounces at decadentplace.org.uk] On Behalf Of Boehm,
Hans
	Sent: Tuesday, June 19, 2007 4:21 PM
	To: C++ threads standardisation
	Subject: RE: [cpp-threads] Slightly revised memory model
proposal (D2300)

________________________________

		From:  Raul Silvera

		Hans wrote on 06/15/2007 12:09:39 PM:

		> Unfortunately, I think I posted some misinformation
here, with respect
		> to flickering.  I believe the version of the example
that I posted:
		> 
		> > > Thread 1:
		> > > store_relaxed(&x, 1);
		> > >
		> > > Thread 2:
		> > > store_relaxed(&x, 2);
		> > >
		> > > Thread 3:
		> > > r1 = load_acquire(&x); (1)
		> > > r2 = load_acquire(&x); (2)
		> > > r3 = load_acquire(&x); (1)
		> > >
		> is already allowed to flicker under the D2300 rules.
And looking back
		> at Sarita's example, weakening this doesn't seem to
help.  (The example
		> that we should really have been discussing would have
had release stores.
		> That's the one that's currently constrained by the
modification order
		> rule.  And having that flicker does seem dubious.)

		I find this very troubling. From T3's point of view, it
is just doing acquire 
		operations, and it is not expecting any flickering,
regardless of which stores 
		are going to satisfy its loads.  

	I was almost going to agree with you, and try to change this.
But this again runs into synchronization elimination issues, which seem
central here.  If the "acquire"s in thread 3 mean anything without a
matching "release" then, by similar reasons,

	r1 = load_relaxed(&x); r2 = load_relaxed(&x); r3 =
load_relaxed(&x);

	can allow different outcomes from

	r1 = load_relaxed(&x); fetch_and_add_acq_rel(&dead1, 0); r2 =
load_relaxed(&x); fetch_and_add_acq_rel(&dead2, 0); r3 =
load_relaxed(&x);

	which means that the dead fetch_and_adds can't be eliminated,
which is very unfortunate.  It also means that I can't ever eliminate
locks after thread inlining without understanding the whole program.

	I'm more and more inclined to do what Sarita was advocating
anyway, which is to switch to a more conventional formulation of the
memory model in which happens-before is
	just the transitive closure of the union of sequenced-before and
synchronizes-with.  That makes it clearer that acquire and release only
provide any guarantees if they occur in pairs.

	(The last proposal has another similar synchronization
elimination issue with the "precedes" relation, which includes
happens-before, but not sequenced-before.  I think we can also get rid
of by moving back to a more conventional, Java-like, happens-before
model.)

	My general feeling is that if we have a trade-off between
synchronization elimination and more expressive low-level atomics,
synchronization elimination should win, since it effects lock-based user
code, which is bound to make up a much larger body of code than
low-level atomics clients.

	And although I also find this a bit troubling, I'm still having
a lot of trouble constructing a case in which this matters.

	Hans

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.decadentplace.org.uk/pipermail/cpp-threads/attachments/20070620/3cbaff89/attachment.html