[cpp-threads] Yet another visibility question

Tue Nov 21 23:33:49 GMT 2006

I'm not convinced we are talking about the same problem.

On 11/21/06, Boehm, Hans <hans.boehm at hp.com> wrote:
> I still don't fully agree with Lawrence's arguments, but the more I look
> at this, the more I'm inclined to agree with Lawrence's and Peter's
> conclusion.  Details below:
>
> > -----Original Message-----
> > From: Lawrence Crowl
> > [Hans:]
> > > I don't really agree with Lawrence that this is unusable.
> > The rules
> > > are fairly clear, and we all agree that non-wizards should use
> > > fetchadd_full here anyway.
> >
> > I fail to see the use of fetch_add_release if it does not
> > release its writes to all subsequent readers.  If the
> > behavior of a third thread can prevent two threads from
> > communicating, we have a problem.  In that case, one must
> > rely on other synchronization, and we could as well have used
> > a totally unsynchronized fetchadd.
> Since fetch_add both reads and writes a variable, the normal use of the
> different variants is:
>
> fetch_add_full: The default that should normally be used in examples
> such as the one we're discussing here.
>
> fetch_add_acquire, fetch_add_release: Makes sense if fetch_add is used
> to implement lock/unlock operations of some flavor.  Each
> fetch_add_release value is read by a fetch_add_acquire, and thus each
> lock holders memory operations become visible to the next lock holder.
> Everything behaves as expected.  You can get away without the "full"
> variant because we're in a special case in which the operations occur in
> pairs.

Perhaps the issue is what are the use cases for non-fully-ordered
fetch_add operations.  The following seems plausible, but I think it
depends on all release edges being live.

   A reference-counted object uses fetch_add_acquire to both log
   the new referer and to obtain the object contents.

I'll have to ponder use-cases for fetch_add_release.

> fetch_add_raw: Useful for atomically updating counters that will only be
> read after the threads are joined.
>
> I think these uses are all fairly clear and make sense.  We're really
> discussing a (admittedly important) borderline case in which you might
> be able to get away with fetch_add_release, but in my mind,
> fetch_add_full is the natural construct.

If the case is borderline, why should we burden the standard with
the complexity?

>
> In a multithreaded process, a third thread can always mess up
> communication between the other two if it has access to one of their
> shared variables.  So that doesn't particularly bother me either.
> >
> > > It seems to be slightly harder to specify the version in
> > which all the
> > > edges are introduced.  For fetch_add, there is clearly a
> > total order
> > > among all the operations on z.  In general, we currently do
> > not assume
> > > a total order on stores at this level, so the notion of
> > "all previous"
> > > is not well-defined.  But I think that's fixable, especially if
> > > someone can come up with a strong reason for doing so.
> >
> > In this case, no total order is required because the
> > (non-atomic) writes are to different variables.  (That is,
> > there is no race.)
> There is no race, but you are assuming an order on the updates to the
> atomic z.  And for fetch_add, that seems very reasonable.  For ordinary
> stores it may well be more reasonable than the current proposal, but it
> is not what it currently says.

I don't think I was depending on the order.  To recap, the example is

// x y z initially zero
// thread 1
x = 1; fetch_add_release( &z, +1 );
// thread 2
y = 1; fetch_add_release( &z, +1 );
// thread 3
if( load_acquire( &z ) == 2 ) assert( x == 1 && y == 1 );

For the if condition to be true, both releases must have occurred
and the new values of x and y should be visible.  There is no
dependence on their order, only that they both occurred.  Am I
missing something?

> The following example may shed light on
> both the difficulty, and why we might want to address it anyway:
>
> Consider (v atomic, x and y ordinary, all assignments to/from v are
> really load_acquire or store_release operations, everything initially
> zero):
>
> [These are really two different example wrapped into one.]
>
> Thread 1:
> x = 1;
> v = 1;
>
> Thread 2:
> y = 1;
> v = 2;
>
> Thread 3:
> r1 = v; (sees 1)
> r2 = v; (sees 2)
>
> Thread 4:
> if (v == 2) assert (x == 1);
>
> If we look at threads 1 through 3, it appears that the assignment of 1
> to v (with a release store) occurs first.  Thus we might be tempted to
> believe that a subsequent load_acquire of v as in thread 4 would ensure
> visibility of both the assignments to x and y.
>
> But in the current memory model proposal, this is actually nonsense.
> There is no order between the two assignments to v, and the asssert is
> clearly allowed to fail.

I agree that reading v==2 does not imply that x==1 in this case.  At
present, we do not have a total store order.  The question is what is
different between the initial case and your alternate case.

>
> Thread 5:
> r4 = v; (sees 2)
> r5 = v; (sees 1)
> r6 = v; (sees 2)

> In fact, the outcomes in thread 5 are allowed,
> which makes it fairly clear that we can conclude nothing from thread 3.
> (The only happens-before edges are within each thread and from the
> stores in threads 1 and 2 to the matching reads.  Since there are no
> loads in threads one and two, there is no opportunity for a precedes
> cycle.  And clearly no store "happens between" the corresponding stores
> and loads.)
>
> Having said that, I am not at all convinced that we want to allow the
> thread 5 outcomes.  In fact, I think I'm less comfortable with the
> thread 5 results here than I am with Peter's original example.  As far
> as I know, all current hardware precludes the thread 5 outcomes in the
> absence of compiler reordering, though that would be good to verify.
>
> We could disallow the thread 5 behavior by explicitly requiring that
> stores to each single atomic object x occur in a single total order
> "store-order-for-x", and that all such orders be included in the
> "happens-before" relation.  This would guarantee that the assert in
> threads 1 to 4 from the above example always succeed, assuming the
> thread 3 outcomes.

I also am uncomfortable with thread 5 outcomes.

Would outcome 5 be prevented by a weaker constraint that each thread
have a sequentially consistent view of the values, even though those
views may differ from thread to thread?

> This would make it possible to define "previous store-releases", so that
> we could make the modification that Peter and Lawrence want.

I'd still like to hear what the hardware folks have to say about total
store order before committing to that, but I think there is something
intermediate between that and what you are saying about the current
document.

>
> I'd be inclined to do both, even though it does make the spec slightly
> longer and superficially more complicated.
>
> I think there are no compatibility issues with Prism, since that only
> defines much stronger atomics anyway.  This effectively makes our weak
> atomics slightly stronger.
>
> Other opinions?

-- 
Lawrence Crowl