[cpp-threads] Yet another visibility question

Sat Dec 9 01:53:11 GMT 2006

I finally managed to write up a possible revision to Clark's and my
proposal.  Apologies for the length.  But the
actual text changes are rather small; it's just the discussion and
examples that are long.  I'm not sure that this
is completely right, or even what we really want.  And I'm sure that
Clark would want to rephrase at least parts of it.  But in any case, I'd
appreciate feedback.  (Everyone other than Clark can probably start
reading at the 1.10p5 change.  The stuff before that is minor.)

1.10p1:
{On rereading, talking about interleaved execution here may be
confusing, in
that it initially really makes too strong a statement, and then backs
off
in later paragraphs.}
"The execution of the entire program consists of an interleaved
execution of
all of its threads." ==>
"The execution of the entire program consists of an execution of all of
its
threads. [Note: Usually the execution can be viewed as an interleaving
of
all its threads.  However some kinds of atomic operations, for example,
allow executions inconsistent with a simple interleaving, as described
below.]

1.10p4:
{Restate slightly:}
"The library defines a number of operations, such as operations on
locks,
that are specially identified as synchronization operations."
==>
"The library defines a number of operations, such as operations on locks
AND
ATOMIC OBJECTS, that are specially identified as synchronization
operations."
{This is just a clarification.}

"A synchronization operation is either an acquire operation or a release
operation, or both, on one or more memory locations."
==>
"A synchronization operation MAY BE either an acquire operation or a
release
operation, or both."
{Since even raw atomics introduce "synchronizes with" relationships, it
makes
sense to also call them synchronization operations.  In this
formulation,
I don't think it helps to have acquire and release operations formally
associated with a location, though effectively they still are, as seem
immediately below.}

1.10p5:
{Generalize the second condition for inter-thread-ordered-before
to ordinary writes, but not reads:}
"A is an unordered atomic read and B is an unordered atomic write, and
either
the value written by B varies depending on the value read by A, or the
execution of B is conditioned on the value read by A."
==>
"A is an unordered atomic read and B writes
a value which depends on the value read by A, or the
execution of B is conditioned on the value read by A."
{I believe this is generally true anyway, so long as the compiler treats
an unordered atomic read as completely opaque, since it is not allowed
to perform the write B speculatively.  I don't know of any hardware
that might perform a dependent store out of order.}

1.10p7:
{Add synchronizes-with relationships back in for all writer-reader
pairs of atomic memory operations.  Whether or not something is an
acquire or release operation no longer affects "synchronizes-with",
but it still affects "inter-thread-ordered-before" and hence
"happens-before".  This means that "raw" operations can be used
to synchronize threads if the "inter-thread-ordered-before"
relationships are introduced by other ordered operations, which
now act more like fences, as they generally do on hardware.
We also provide per-variable TSO, which is what triggered all these
changes.}
"An evaluation A that performs a release operation on a location L
synchronizes-with an evaluation B that performs an acquire operation on
L and
reads the value written by A."
==>
"All modifications to a particular memory location L occur in some
particular total order, referred to as the modification order for L.
An evaluation A that performs a synchronization operation that modifies
location L synchronizes-with an evaluation B that reads the value
written
by A, or reads a value that was subsequently (in modification
order) written to L.

1.10p8:
{Add another clause to the happens-before definition.  A happens before
B
if ... or:}
- A occurs before B in the modification order for any location L.
{At first glance this seems too strong.  But I think it's again
OK, since, unlike Java, we limit inter-thread-ordered-before, and hence
the happens-before ordering within a thread.  See examples below.}

Some examples, attempting to illustrate borderline behavior:

(everything initially zero, as usual, v, w atomic (with
acquire/release operations except were indicated), x 
ordinary variables)

Example 1:

Thread 1: v =raw 1;
Thread 2: v =raw 2;
Thread 3: r1 = v; (1)
	  r2 = v; (2)
	  r3 = v; (1)

This is now disallowed.  If thread 1 occurs before thread 2 in v's
modification
order, then thread 2's action "occurs between" the write of 1 to v and
the
final read.  If thread 2 occurs before thread 1 in the modification
order,
then thread 1's action occurs between the write of 2 and the read of 2.

The only part of this that might be controversial and adjustable is that
we
apply this even to a raw (though atomic) store, and not just release
operations.

The readers do need to use acquire operations to get the right
inter-thread-ordered-before operations within thread 3.

Note that this relies critically on the addition of per-variable TSO.
In its current form, it also relies on the strengthened version
of synchronizes-with.

Example 2:

Thread 1: x = 1;  // not atomic
	  v = 1;  // atomic

Thread 2: v = 2;
	  r1 = x; (0)

Thread 3: r2 = v; (1)
	  r3 = v; (2)

By the same reasoning as in the preceding example, the store of 1 to v
must occur before the store of 2 in v's modification order.  Thus the
v = 1 assignment happens-before the v = 2 assignment.  However v = 2
is NOT inter-thread-ordered-before r1 = x, and hence does not
happen-before
r1 = x.  As a result, it is not guaranteed to see x = 1, and this
execution
is legal.

Example 3 (slight modification of 2):

Thread 1: x = 1;  // not atomic
	  v = 1;  // atomic, release store

Thread 2: v = 2;
	  w++;	  // atomic fully ordered increment
	  r1 = x; (0) // acquire load

Thread 3: r2 = v; (1)
	  r3 = v; (2)

In this case, the otherwise unrelated increment operation does ensure
that
v = 2 is inter-thread-ordered-before r1 = x.  Hence x = 1 happens-before
r1 = x, the initial value of x is no longer visible, and this outcome is
NOT
allowed.

I believe this is

a) more constraining on the implementation than what we had previously
decided.
b) still consistent with the behavior I would expect from
implementations,
since the increment normally acts as a fence.

Certainly standard compiler transformations cannot result in this
outcome.

Note that it removes any hope of optimizing out thread-local
synchronization.
The variable w in the above may be local to thread 2, but "w++" cannot
be
replaced by an ordinary nonatomic increment; the ordering semantics are
still
visible.  (This is different from Java, but also unlike Java, I think
this is
unlikely to be a very profitable optimization for C++.

It does mean that we cannot always inline threads and eliminate the
synchronization only involving those threads, since that synchronization
can
impose visible ordering constraints on the remaining threads.  I think
there is no way to change that without restricting cross-thread
happens-before to involve only acquire and release operations, as
before.
Both the 1.10p7 and 1.10p8 changes violate that.  (In some sense they
are very similar.  With the 1.10p7 change, I think 1.10p8 amounts to
treating atomic writes as always reading the old value.)

I do think that thread inlining works if all atomic reads have acquire
semantics and all writes have release semantics.  It's unclear whether
you would perform this sort of transformation without having full access
to all the code involved, so Herb Sutter and David Callahan may still be
happy:  If you don't use those ugly low level atomics, nothing much
changes.

Example 4: (Peter Dimov)

(N copies:)

  // threads 1 to N:

  assert( x == 0 );

  if( fetchadd_release( &v, 1 ) == N - 1 ) {
     x = 1;
  }

The fetch_add_releases are happens-before ordered both because of the
modification order on v and because they read each others writes.  The
"x = 1" write is inter-thread-ordered after the fetchadd in the same
thread.
Hence "assert( x == 0 )" now happens-before "x = 1" and the
assertion must succeed.

Peter - With these changes, are you convinced that reference counting
can
be efficiently implemented?  I haven't thought it through.