[cpp-threads] Memory ordering and N2171

Sun Jun 3 03:10:00 BST 2007

Hello, Hans,

Reviewed the new section 1.10 in N2171, and I believe the issue is
reading too much into the "precedes" relation.  Given value-speculation
compiler optimizations and hardware store buffers (in all but sequentially
consistent machines), it is possible for A to "precede" B, but B to
be executed before A in real time.  Thus, some of the properties called
out for "precedes" do not hold.  Given these properties, we really should
not be calling this "precedes"...

So, going paragraph by paragraph, no doubt horribly redundantly with
earlier discussion, apologies in advance, but doing it this way helps
people spot any misinterpretations that I might be falling prey to:

> 1.10p1:
> 
> 	Under a hosted implementation, a C++ program can have
> 	more than one thread of execution (a.k.a. thread) running
> 	concurrently. Each thread executes a single function according
> 	to the rules expressed in this standard. The execution of
> 	the entire program consists of an execution of all of its
> 	threads. [Note: Usually the execution can be viewed as an
> 	interleaving of all its threads. However some kinds of atomic
> 	operations, for example, allow executions inconsistent with a
> 	simple interleaving, as described below. —end note ] Under
> 	a freestanding implementation, it is implementation-defined
> 	whether a program can have more than one thread of execution.

So far, so good.

> 1.10p2:
> 
> 	The execution of each thread proceeds as defined by the remainder
> 	of this standard. The value of an object visible to a thread T
> 	at a particular point might be the initial value of the object,
> 	a value assigned to the object by T, or a value assigned to the
> 	object by another thread, according to the rules below.

Very good -- prohibits a load from an atomic variable from seeing a
"mashup" of several preceding values.  A surprisingly useful property.

> 1.10p3:
> 
> 	Two expression evaluations conflict if one of them modifies a
> 	memory location and the other one accesses or modifies the same
> 	memory location.

Probably need to add something about concurrency.  The intent is clear to
me, but the above could potentially be misinterpreted as defining the
following to be conflicting:

	t = a;
	t++;
	a = t;

Of course, unless multiple threads can be executing this code
concurrently, the accesses to "a" do not conflict.  One possible fix
would be something like the following:

	Two expression evaluations conflict if (1) they can be executed
	concurrently by different threads and (2) one of them modifies
	a memory location and the other one concurrently accesses or
	modifies the same memory location.

#1 is needed to handle the possibility of super-scalar machines providing
inter-thread parallelism.

> 1.10p4:
> 
> 	The library defines a number of operations, such as operations
> 	on locks and atomic objects, that are specially identified
> 	as synchronization operations. These operations play a
> 	special role in making assignments in one thread visible
> 	to another. A synchronization operation is either an acquire
> 	operation or a release operation, or both, on one or more memory
> 	locations. [Note: For example, a call that acquires a lock will
> 	perform an acquire operation on the locations comprising the
> 	lock. Correspondingly, a call that releases the same lock will
> 	perform a release operation on those same locations. Informally,
> 	performing a release operation on A forces prior side effects
> 	on other memory locations to become visible to other threads
> 	that later perform an acquire operation on A. —end note ]

OK, presuming that "both" maps to _seq_cst in Lawence's recent email.
However, the possibility of _relaxed needs to be mentioned here.

> 1.10p5-6, previously containing the definition of "inter-thread ordered
> before", have been deleted from this revision. Subsequent paragraphs
> will be renumbered eventually.
> 
> This was rewritten in terms of "synchronizes with", which is restricted
> to synchronization operations, instead of explicitly including store-load
> dependencies in a "communicates with" relation as in N1944. This version
> is intended to be equivalent, since we insist that "happens before"
> together with store-load dependencies remains acyclic. We need that
> for the race free implies sequential consistency proof, and for one of
> the examples.

More on this later...

> 1.10p7:
> 
> 	All modifications to a particular atomic object M occur in
> 	some particular total order, called the modification order of
> 	M.

Presumably this excludes _relaxed operations.  Consider a program where
each thread store-relaxes a unique value into a single atomic variable.
All ordered observations of that variable will see a sequence of values
consistent with at least one global order, but I have seen cases where a
single 16-CPU machine might up to five different simultaneous opinions
about the value of a given variable, due to store buffering.  And this
is with the restriction that a reading CPU awaiting a cache miss has
no opinion -- otherwise, more than 10 simultaneous conflicting opinions
would not be out of the ordinary.

Nevertheless, ignoring wall-clock time, the sequence of values seen
by each CPU is in fact consistent with some global ordering.  Usually
consistent with some very large set of potential global orderings.

So, yes, there is in some sense a total order of modification for
_relaxed operations on atomic variables, but this order does not
have much in common with "happens before".  It is still possible to
trace ordering through a series of such accesses in some cases, but
it is possible to get in trouble here.

This should not be surprising -- a big part of the motivation for
_relaxed is to avoid heavyweight fence operations, and the ordering
properties that they imply.

> 	   An evaluation A that performs a release operation on an
> 	object M synchronizes with an evaluation B that performs an
> 	acquire operation on M and reads either the value written by
> 	A or a later value in the modification order of M. [Note: The
> 	specifications of the synchronization operations define when
> 	one reads the value written by another. For atomic variables,
> 	the definition is clear. All operations on a given lock occur
> 	in a single total order. Each lock acquisition "reads the value
> 	written" by the last lock release. —end note ]

OK, releases match acquires.

> 1.10p8:
> 
> 	An evaluation A happens before an evaluation B if:
> 
>         * A is sequenced before B and either A performs an acquired
> 	    operation or B performs a release operation; or
>         * A synchronizes with B; or
>         * for some evaluation X, A happens before X and X happens before B.
> 
> 1.10p9:
> 
> 	An evaluation A precedes an evaluation B if:
> 
>         1. A happens before B; or
>         2. A is an assignment, and B observes the value stored by A; or
>         3. for some evaluation X, A precedes X and X precedes B.

I believe that the presence of store buffers can result in loops in
your "precedes" order.   In particular, different CPUs can see different
interleavings of value sequences for a given pair of atomic variables.

One issue is that B observing the value stored by A does not guarantee
that B's results will be observed to follow those of A.  One reason is
that B and A might share a store buffer, so that B sees the results of
A earlier than do other CPUs.  The cache line containing A's results
might be delayed for any number of reasons, while that for B's results
might be already in place and modifiable.

Therefore, #2 does not hold on real hardware, even on relatively strong
TSO machines.  It may seem strange that someone could see a new value
of a variable before the corresponding assignment officially happened,
but that really is the way it works on all but strict sequentially
consistent machines.

> 1.10p10:
> 
> 	A multi-threaded execution is consistent if each thread observes
> 	values of objects that obey the following constraints:
> 
>         * No evaluation precedes itself.

Not good, see above.

A weaker condition that -would- hold would be that there is no cycle
in the "precedes" graph containing either a sequentially consistent
operation or a "synchronized-with" acquire-release pair.  At least
I -think- this would hold.  Assuming that it does, it would be easy
to validate.  It also has intuitive appeal, if looked at correctly --
after all, if you really want your operations to be ordered, you bloody
well need to use some ordering operations!!!  ;-)

Note that in the presence of some of the more aggressive value-speculation
compiler optimizations that have been discussed on this list, the cycle
could even consist of a circular chain of assignments, each depending
on the value from the previous assignment in the chain.  (This does not
necessariy constitute an out-of-thin-air situation, because there would
presumably be checks that the speculated values were in fact correct, and
these checks would presumably not depend on the speculated computation.)

>         * Each read access B to a scalar object M observes the value
> 	    assigned to M by a side effect A only if there is no other
> 	    side effect X to M such that
>               o A is sequenced before or happens before X, or A and X
> 		  are synchronization operations and A precedes X in the
> 		  modification order of M, and
>               o X is sequenced before or happens before B.

Given the possibility of cycles in the sequenced-before relation,
not sure this works well.  Besides, if A and X execute concurrently,
this relation is of almost no help to the programmer.  You can't
tell ahead of time which will win the global-store-order sweepstakes,
so you cannot conclude anything from it.  If the "sequenced before"
parts are discarded, then it has definite meaning.

But then there is the possibility of "out of thin air" values.

> 	[Note: The first condition implies that a read operation B cannot
> 	"see" an assignment A if B happens before A. It also prevents
> 	cyclic situation in which, for example x and y are initially zero,
> 	one thread evaluates x = y; while another evaluates y = x;, each
> 	sees the result of the other thread, and both x and y obtain a
> 	value of 42. The second condition effectively asserts that later
> 	assignments hide earlier ones if there is a well-defined order
> 	between them. —end note ]

OK.  We are stuck with cycles in the "precedes" graph.  We really want
to say that any value loaded from an atomic variable came from somewhere
real.  So, what conditions do compiler-writers apply to value speculation
to prevent out-of-thin-air values in practice? 

Suppose we instead build the data-flow graph and require that -it- have
no cycles?  For example, take IRIW on POWER:

	T0: X=1;
	T1: Y=1;
	T2: r1=X; twi; isync; r2=Y;
	T3: r3=Y; twi; isync; r4=X;

The "precedes" graph has cycles:

	X=1 => r1=X[1] -> r2=Y[0] => Y=1 => r3=Y=1 -> r4=X[0] => X=1

However, the "->" links are not present in the dataflow graph, so there
is no cycle in the dataflow graph, and thus no out-of-thin-air values.
Note that such a data-flow graph need not be actually constructed by the
compiler -- it simply excludes out-of-thin-air values in the standard.
This approach also allows full freedom to the compiler writer and
hardware implementer to use whatever combination of strategies achieve
the desired result.

Thoughts?

> 1.10p11:
> 
> 	An execution contains an inter-thread data race if it contains two
> 	conflicting actions in different threads, at least one of which is
> 	not atomic, and neither happens before the other. Any inter-thread
> 	data race results in undefined behavior. A multi-threaded program
> 	that does not contain a data race exhibits the behavior of a
> 	consistent execution.

Not sure what "consistent execution" means in this context.  It cannot
mean "sequentially consistent", since a program consisting only of
_relaxed access to atomics would not be guaranteed to be sequentially
consistent.

I guess the thing that makes the most sense is that "consistent execution"
means that each load of a given variable sees either the initial value
or a value stored by some preceding operation, and not some mashup
of multiple values.  Is there something else entailed by "consistent
execution"?

> 	                      [Note: It can be shown that programs that
> 	correctly use simple locks to prevent all inter-thread data
> 	races, and use no other synchronization operations, behave as
> 	though the executions of their constituent threads were simply
> 	interleaved, with each observed value of an object being the last
> 	value assigned in that interleaving. This is normally referred
> 	to as "sequential consistency". However, this applies only to
> 	race-free programs, and race-free programs cannot observe most
> 	program transformations that do not change single-threaded program
> 	semantics. In fact, most single-threaded program transformations
> 	continue to be allowed, since any program that behaves differently
> 	as a result must perform an undefined operation. —end note ]

Programs with data races cannot even expect loads to return sensible
values (at least in absence of common but non-standard extensions).
I believe that this issue is much more important than the relation to
sequential consistency.

> 1.10p12:
> 
> 	[Note: Compiler transformations that introduce assignments to a
> 	potentially shared memory location that would not be modified by
> 	the abstract machine are generally precluded by this standard,
> 	since such an assignment might overwrite another assignment by a
> 	different thread in cases in which an abstract machine execution
> 	would not have encountered a data race. —end note ]

Here "potentially shared" means any variable that cannot be proven
to be thread-local?

Final point -- the name "precedes" is quite misleading.  An operation A
that "precedes" an operation B might in fact happen -after- operation
B by wallclock time, courtesy of value-speculation optimizations and
store buffers.  The "precedes" relation is at best an after-the-fact
artificial ordering that is not necessarily visible to the programmer.
How about something like "pseudo precedes" to make the artificiality of
this relation painfully clear?

						Thanx, Paul

PS.  Please note that a string of _relaxed atomic operations to a
     single variable does not necessarily generate a happens-before
     relationship.  Hardware speculative execution is one way to get
     parallel atomicity.  Architectures that populate memory with
     ALUs is another -- in this case, the atomic operation might be
     shipped to memory rather than the memory to the atomic operation.
     Sufficiently weak memory machines can also have this property.