[cpp-threads] N2052 sequencing revisions & examples (moderately long)

Fri Mar 9 06:00:51 GMT 2007

Here's another stab at further revisions to N2052.

As explained below, there is a clear trade-off between the ability to have
the compiler eliminate synchronization on thread local objects (and hence
productively combine threads), and getting maximum ordering benefit out of
the normal fence-based implementations in connection with raw aromics.
I've heard strong requests in favor of both.

In this version, I favor thread-local synchronization elimination.
My intuition is that that's the right thing to do, but I don't
expect this to be adopted without debate.

1.10p1:
{On rereading, talking about interleaved execution here may be confusing, in
that it initially really makes too strong a statement, and then backs off
in later paragraphs.}
"The execution of the entire program consists of an interleaved execution of
all of its threads." ==>
"The execution of the entire program consists of an execution of all of its
threads. [Note: Usually the execution can be viewed as an interleaving of
all its threads.  However some kinds of atomic operations, for example,
allow executions inconsistent with a simple interleaving, as described below.]

1.10p4:
{Restate slightly:}
"The library defines a number of operations, such as operations on locks,
that are specially identified as synchronization operations."
==>
"The library defines a number of operations, such as operations on locks AND
ATOMIC OBJECTS, that are specially identified as synchronization operations."
{This is just a clarification.}

1.10p5:
{Delete paragraph.  Part of this is moved to 1.10p8.  The current plan
is to put any discussion of dependencies or out-of-thin-air values
in the atomics section of the library.}

{The note following 1.10p5 should also be deleted.}

1.10p7:
{Explicitly require "cache coherence" of "per variable TSO".  The second part
of this is in 1.10p10. Add:}
"All modifications to a particular memory location L occur in some
particular total order, referred to as the modification order for L."

{before the sentence (modifications in capitals):}

"An evaluation A that performs a release operation on a location L
synchronizes-with an evaluation B that performs an acquire operation on L and
reads EITHER the value written by A OR A LATER VALUE (IN MODIFICATION
ORDER)."

1.10p8:
{Replace the "inter-thread-ordered-before" condition with: }

"A is sequenced before B and either A performs an acquire operation, or
B performs a release operation"

{from 1.10p5}

1.10p10:
Change

"*  A is sequenced before or happens-before X, and"

to

"*  A is sequenced before or happens-before X, or A and X are synchronization
operations and A precedes X in the modification order, and"

What has been in flux recently:

- Whether all corresponding loads and stores or just acquire/release
ones contribute to synchronizes-with.  (This affects thread local
synchronization elinations vs. ordering strength.)

- The treatment of the modification order, and how it interacts with
happens-before.

Open questions:

[Superficial] Should we get rid of "synchronizes with" and just
directly integrate it into happens-before?

[Slightly deeper]  Can "sequenced before" contributed directly to
happens-before instead of using a subset for acquire and release operations?
This would make the model simpler, and bring it closer to the Java model.
Some of the need for this disappeared.  But I think we still need the
distinction, since we don't want (all raw operations):

Thread 1:
r1 = x; (1)
y = 1;

Thread 2:
r2 = y; (1)
x = 1;

to result in a precedes cycle.  But there may be a better way to phrase
this.  (This is still only a question of presentation, I think.)

[Profound] Can we accomodate any of the IBM proposals without significantly
changing this.  My guess is that we can model a suitable notion of fences.
But if we decide to weaken the rules for "high level" atomics, then none of
the current models for that look much like this.
(In my view, we would need something
significantly simpler than those models for this to really become feasible.
If we provide both sequential consistency and a weaker option to
accomodate existing hardware, we may be able to get away with less
precision for the latter.)

General observations:
(You may want to read the examples below first to understand
some of this.)
This model tends to err on the weak side, by not having raw loads
and stores imply any ordering, even in the presence of other operations
with both acquire and release semantics.  I believe this ensures
that even fully ordered RMW operations can be optimized to the fence-less
nonatomic version if the affected location is thread-local.  This
seems to be important if we want the compiler to be able to combine
excessively fine-grained threads.  Some of us suspect that this is
likely to become important in the future, but we have little to go on.
It is consistent with Java's approach.

The down side is that we sometimes get surprisingly weak ordering
properties in obscure cases, including some of Peter Dimov's
test cases, where the stronger formulation might avoid a fence.  None
of these problem cases occur if loads always have acquire semantics,
stores always have at least release semantics, and RMW operations
always have both.  (For those of you participating in some of the other
atomics discussions, I think this is equivalent to assuming at least
"weak C4" rules.)

I believe this can be fairly easily extended to cover fences.  But there
are details that need to be explored if we do include fences.  In particular,
different architectures have subtly different fence semantics, and the
challenge is to make our semantics weak enough to cover them all.

This version assumes that relying on dependence to imply ordering
is hopeless at this level.  Peter Dimov and I have tried, as did
the JSR133 team before us, and we have not succeeded.
It appears technically infeasible to generate such a definition
that doesn't have more severe adverse consequences than the problems
we are trying to solve.  I am not arguing that it is fundamentally
a bad idea to add dependence-based ordering to the definition.
Anyone is welcome to take this as a challenge to come up with a better
technical solution.

(I will admit to not being favorably disposed towards solutions that have
compilation impact on other compilation units and on existing dynamic
libraries.  I am not convinced that in the big scheme of things,
low level atomics would be perceived as important enough to warrant
such a change.  If the committee disagrees, I'd be happy to reconsider.
I know that at least one person disagrees.)

Some examples, attempting to illustrate borderline behavior:

(everything initially zero, as usual, v, w atomic (with
acquire/release semantics except were indicated), x
ordinary variables, values read given in parentheses)

Note that unless otherwise stated "=" on volatiles here denotes store_release
or load_acquire, NOT the default assignment from the atomics proposal.

Example 0: (all operations raw/relaxed/unordered for this example)
Thread 1: r1 = v (42); w = r1;
Thread 2: r2 = w (42); v = r2;

(Out of thin air result.)  Each read sees the write by the other
thread.  Since we deleted the dependency clause from 1.10p5, this
is now allowed by the core memory model.  The intent is that
we would disallow this behavior in the atomics library section,
using phrasing vaguely similar to what I proposed earlier.
This is based on the problems we had with dependency-based ordering
in earlier proposals.  Hence the best option appears to be a potentially
slightly vague statement in an obscure corner of the library definition,
rather than some serious hand-waving in the core memory model.
This only affects raw atomics in any case.

Example 1a:

Thread 1: a: v = 1;
Thread 2: b: v = 2;
Thread 3: c: r1 = v; (1)
	  d: r2 = v; (2)
	  e: r3 = v; (1)

This is now disallowed.

Assume b occurs after a in v's modification order.  Then

c happens-before d happens-before e ;acquire semantics
a happens-before c, b happens-before d ; synchronizes with

Thus b happens-before e, and a occurs before b in the modification
order.  As a result of the 1.10p10 change, e cannot see a.

If b occurs before a in v's modification order, we similarly
get a happens-before d, and b occurs before a in the modification
order, hence d cannot see b as described.

Example 1b:

Assume all of the stores in the above example are raw/relaxed.

Now the example is allowed.  This is unsurprising, since intuitively
the loads in thread 3 can be arbitrarily reordered.

Example 1c:

Assume only the Thread 1 and 2 stores from example 1a are raw/relaxed.

The example outcome is still allowed, since the synchronizes-with
relations are missing.  This is consistent with the general rule
that a release operation only has visible implications if there is a matching
acquire operation in another thread.  It is also consistent with the
outcome if threads 1 and 2 were combined into a single thread.
It is inconsistent with what one would expect from a normal hardware
implementation.

Although I'm not enthusiastic about the 1c outcome, I guess it
doesn't bother me either.

Example 2:

Thread 1: x = 1;  // not atomic
	  v = 1;  // atomic

Thread 2: v = 2;
	  r1 = x; (0)

Thread 3: r2 = v; (1)
	  r3 = v; (2)

By the same reasoning as in the preceding example, the store of 1 to v
must occur before the store of 2 in v's modification order.  But there
is no happens-before relationship between either v = 1 and v = 2
or the two actions of thread 2.  As a result Thread 2 may see a 0
value for x.

As always, cross thread visibility is only ensured by matching
acquire/release operations.  There are none between thread 1 and 2.
And indeed, although it is not shown in this execution, this code
inherently has a race on x.

Example 3 (slight modification of 2):

Thread 1: x = 1;  // not atomic
	  v = 1;  // atomic, release store

Thread 2: v = 2;
	  w++;	  // atomic fully ordered increment
	  r1 = x; (0) // acquire load

Thread 3: r2 = v; (1)
	  r3 = v; (2)

Although this forces a happens-before ordering on the two actions of thread 2,
there is still no happens-before relationship between threads 1 and 2.
Hence the result doesn't change.

This was desired by at least two of us, since it ensures that if w is
accessed by only one thread, it can be replaced by an ordinary increment.

Example 4: (Peter Dimov)

(N copies:)

  // threads 1 to N:

  assert( x == 0 );

  if( fetchadd_release( &v, 1 ) == N - 1 ) {
     x = 1;
  }

All fetchadd_release operations occur in a total modification order.
But there are no synchronizes-with or happens-before relationships
across threads.  Hence the assertion can fail.

Indeed fetchadd_release operations should generally be used only if
no visibility implications are desired, or if the code previously executes an
acquire operation on the same object.

Hans