[cpp-threads] Slightly revised memory model proposal (D2300)

Wed Jun 20 03:30:16 BST 2007

On Tue, Jun 19, 2007 at 11:20:36PM -0000, Boehm, Hans wrote:
> ________________________________
> 
> 	From:  Raul Silvera
> 	
> 	Hans wrote on 06/15/2007 12:09:39 PM:
> 	
> 	> Unfortunately, I think I posted some misinformation here, with
> respect
> 	> to flickering.  I believe the version of the example that I
> posted:
> 	> 
> 	> > > Thread 1:
> 	> > > store_relaxed(&x, 1);
> 	> > >
> 	> > > Thread 2:
> 	> > > store_relaxed(&x, 2);
> 	> > >
> 	> > > Thread 3:
> 	> > > r1 = load_acquire(&x); (1)
> 	> > > r2 = load_acquire(&x); (2)
> 	> > > r3 = load_acquire(&x); (1)
> 	> > >
> 	> is already allowed to flicker under the D2300 rules.  And
> looking back
> 	> at Sarita's example, weakening this doesn't seem to help.
> (The example
> 	> that we should really have been discussing would have had
> release stores.
> 	> That's the one that's currently constrained by the
> modification order
> 	> rule.  And having that flicker does seem dubious.)
> 	
> 	I find this very troubling. From T3's point of view, it is just
> doing acquire 
> 	operations, and it is not expecting any flickering, regardless
> of which stores 
> 	are going to satisfy its loads.  
> 	 
> 
> I was almost going to agree with you, and try to change this.  But this
> again runs into synchronization elimination issues, which seem central
> here.  If the "acquire"s in thread 3 mean anything without a matching
> "release" then, by similar reasons,
>  
> r1 = load_relaxed(&x); r2 = load_relaxed(&x); r3 = load_relaxed(&x);
>  
> can allow different outcomes from
>  
> r1 = load_relaxed(&x); fetch_and_add_acq_rel(&dead1, 0); r2 =
> load_relaxed(&x); fetch_and_add_acq_rel(&dead2, 0); r3 =
> load_relaxed(&x);
>  
> which means that the dead fetch_and_adds can't be eliminated, which is
> very unfortunate.  It also means that I can't ever eliminate locks after
> thread inlining without understanding the whole program.

However, the dead fetch_and_add_acq_rel() statements could be downgraded
to the acq_rel_memory_fence() primitive proposed in N2262.  As for the
thread inlining, if the compiler could see the full scope of the relevant
variables, then at the very least, sequences of lock-unlock pairs could
also be downgraded to a single memory fence, correct?  Similar visibility
is required in the first example in order to determine that the two
variables are in fact dead.

But here the concern is that someone is leveraging the ordering
implied by a pair of lock-based critical sections?  If so, another
approach would be to drop that guarantee.  For example:

	store_relaxed(x, 1);
	spin_lock(&lock1);
	...
	spin_unlock(&lock1);
	spin_lock(&lock2);
	...
	spin_unlock(&lock2);
	store_relaxed(y, 1);

If the ordering of the stores to x and y are not guaranteed, then
there would be much more freedom to eliminate the locks.

(Not yet sure that the cure is better than the disease, but seems
worth considering.)

> I'm more and more inclined to do what Sarita was advocating anyway,
> which is to switch to a more conventional formulation of the memory
> model in which happens-before is
> just the transitive closure of the union of sequenced-before and
> synchronizes-with.  That makes it clearer that acquire and release only
> provide any guarantees if they occur in pairs.

I would have to see the new proposal.  Certainly eliminating the
current happens-before properties of _acquire and _release make N2262
fences even more critically needed for existing software.  And we
definitely need something similar to the previous D2300's semantics for
acquire_memory_fence(), release_memory_fence(), acq_rel_memory_fence(),
and ordered_memory_fence().

> (The last proposal has another similar synchronization elimination issue
> with the "precedes" relation, which includes happens-before, but not
> sequenced-before.  I think we can also get rid of by moving back to a
> more conventional, Java-like, happens-before model.)
>  
> My general feeling is that if we have a trade-off between
> synchronization elimination and more expressive low-level atomics,
> synchronization elimination should win, since it effects lock-based user
> code, which is bound to make up a much larger body of code than
> low-level atomics clients.

First, we do need to look at the situations themselves -- we do need
some sort of low-level atomics, and it is easy to imagine them being
effectively eliminated by a series of examples leveraging the above
blanket statement.  Second, the comparison should not be between
the number of uses of low-level atomics (which is not insignificant
in existing parallel C/C++ code) and all lock-based code, but rather
between low-level atomics and lock-based code for which inlining could
be reasonably expected to happen.

> And although I also find this a bit troubling, I'm still having a lot of
> trouble constructing a case in which this matters.

But please keep in mind that weakening the low-level atomics will drive
people to continue use of raw fences.

						Thanx, Paul

> Hans

> -- 
> cpp-threads mailing list
> cpp-threads at decadentplace.org.uk
> http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads