[cpp-threads] Yet another visibility question

Tue Nov 21 22:04:54 GMT 2006

I still don't fully agree with Lawrence's arguments, but the more I look
at this, the more I'm inclined to agree with Lawrence's and Peter's
conclusion.  Details below:

> -----Original Message-----
> From: Lawrence Crowl
> [Hans:]
> > I don't really agree with Lawrence that this is unusable.  
> The rules 
> > are fairly clear, and we all agree that non-wizards should use 
> > fetchadd_full here anyway.
> 
> I fail to see the use of fetch_add_release if it does not 
> release its writes to all subsequent readers.  If the 
> behavior of a third thread can prevent two threads from 
> communicating, we have a problem.  In that case, one must 
> rely on other synchronization, and we could as well have used 
> a totally unsynchronized fetchadd.
Since fetch_add both reads and writes a variable, the normal use of the
different variants is:

fetch_add_full: The default that should normally be used in examples
such as the one we're discussing here.

fetch_add_acquire, fetch_add_release: Makes sense if fetch_add is used
to implement lock/unlock operations of some flavor.  Each
fetch_add_release value is read by a fetch_add_acquire, and thus each
lock holders memory operations become visible to the next lock holder.
Everything behaves as expected.  You can get away without the "full"
variant because we're in a special case in which the operations occur in
pairs.

fetch_add_raw: Useful for atomically updating counters that will only be
read after the threads are joined.

I think these uses are all fairly clear and make sense.  We're really
discussing a (admittedly important) borderline case in which you might
be able to get away with fetch_add_release, but in my mind,
fetch_add_full is the natural construct.

In a multithreaded process, a third thread can always mess up
communication between the other two if it has access to one of their
shared variables.  So that doesn't particularly bother me either.
> 
> > It seems to be slightly harder to specify the version in 
> which all the 
> > edges are introduced.  For fetch_add, there is clearly a 
> total order 
> > among all the operations on z.  In general, we currently do 
> not assume 
> > a total order on stores at this level, so the notion of 
> "all previous" 
> > is not well-defined.  But I think that's fixable, especially if 
> > someone can come up with a strong reason for doing so.
> 
> In this case, no total order is required because the 
> (non-atomic) writes are to different variables.  (That is, 
> there is no race.)
There is no race, but you are assuming an order on the updates to the
atomic z.  And for fetch_add, that seems very reasonable.  For ordinary
stores it may well be more reasonable than the current proposal, but it
is not what it currently says.  The following example may shed light on
both the difficulty, and why we might want to address it anyway:

Consider (v atomic, x and y ordinary, all assignments to/from v are
really load_acquire or store_release operations, everything initially
zero):

[These are really two different example wrapped into one.]

Thread 1:
x = 1;
v = 1;

Thread 2:
y = 1;
v = 2;

Thread 3:
r1 = v; (sees 1)
r2 = v; (sees 2)

Thread 4:
if (v == 2) assert (x == 1);

Thread 5:
r4 = v; (sees 2)
r5 = v; (sees 1)
r6 = v; (sees 2)

If we look at threads 1 through 3, it appears that the assignment of 1
to v (with a release store) occurs first.  Thus we might be tempted to
believe that a subsequent load_acquire of v as in thread 4 would ensure
visibility of both the assignments to x and y.

But in the current memory model proposal, this is actually nonsense.
There is no order between the two assignments to v, and the asssert is
clearly allowed to fail.  In fact, the outcomes in thread 5 are allowed,
which makes it fairly clear that we can conclude nothing from thread 3.
(The only happens-before edges are within each thread and from the
stores in threads 1 and 2 to the matching reads.  Since there are no
loads in threads one and two, there is no opportunity for a precedes
cycle.  And clearly no store "happens between" the corresponding stores
and loads.)

Having said that, I am not at all convinced that we want to allow the
thread 5 outcomes.  In fact, I think I'm less comfortable with the
thread 5 results here than I am with Peter's original example.  As far
as I know, all current hardware precludes the thread 5 outcomes in the
absence of compiler reordering, though that would be good to verify.

We could disallow the thread 5 behavior by explicitly requiring that
stores to each single atomic object x occur in a single total order
"store-order-for-x", and that all such orders be included in the
"happens-before" relation.  This would guarantee that the assert in
threads 1 to 4 from the above example always succeed, assuming the
thread 3 outcomes.

This would make it possible to define "previous store-releases", so that
we could make the modification that Peter and Lawrence want.

I'd be inclined to do both, even though it does make the spec slightly
longer and superficially more complicated.

I think there are no compatibility issues with Prism, since that only
defines much stronger atomics anyway.  This effectively makes our weak
atomics slightly stronger.

Other opinions?

Hans

> 
> > (I'm a little uncomfortable with the absence of a total order on 
> > stores to a single atomic variable, but I haven't convinced myself 
> > that it really matters.  We do allow, for raw atomics:
> >
> > Thread 1: x = 1; x = 2;
> > Thread 2: r1 = x; r2 = x; r3 = x;
> >
> > r1 = r3 = 1 and r2 = 2
> >
> > but since the compiler can reorder the loads in thread 2, I don't 
> > think that's astonishing.  And weakly ordered atomics are weird and 
> > dangerous.)
> >
> > Currently I think it's a tradeoff between simplicity of the 
> > specification and reference counting performance.  I think current 
> > hardware gives you the stronger semantics anyway, though a software 
> > DSM might benefit from the weaker version.
> >
> > Java does give you the stronger semantics, but Java volatiles have 
> > stronger ordering properties anyway.
> >
> > Hans
> >
> > On Mon, 20 Nov 2006, Lawrence Crowl wrote:
> >
> > > On 11/17/06, Peter Dimov <pdimov at mmltd.net> wrote:
> > > > Let me re-ask the same question again using slightly different 
> > > > wording. I'm really not sure of the correct answer. 
> "Correct" as 
> > > > in both "follows from the memory model", and "what we want".
> > > >
> > > > // x y z initially zero
> > > >
> > > > // thread 1
> > > >
> > > > x = 1;
> > > > fetchadd_release( &z, +1 );
> > > >
> > > > // thread 2
> > > >
> > > > y = 1;
> > > > fetchadd_release( &z, +1 );
> > > >
> > > > // thread 3
> > > >
> > > > if( load_acquire( &z ) == 2 )
> > > > {
> > > >     assert( x == 1 && y == 1 );
> > > > }
> > > >
> > > > Will the assert pass?
> > > >
> > > > In other words, does the load-acquire from z in thread 
> 3 introduce 
> > > > sync-with edges to all previous store-releases to z, or just to 
> > > > the last one?
> > >
> > > I'd need to look more carefully to be sure, but I think all edges.
> > >
> > > In any case, I think introducing only the last edge would 
> produce a 
> > > programming model that is effectively unusable.
> > >
> > > --
> > > Lawrence Crowl