[cpp-threads] RE: Memory ordering and N2171

Mon Jun 4 23:30:47 BST 2007

Thanks for going through this.

> From: Paul E. McKenney [mailto:paulmck at linux.vnet.ibm.com] 
> 
> Hello, Hans,
> 
> Reviewed the new section 1.10 in N2171, and I believe the 
> issue is reading too much into the "precedes" relation.  
> Given value-speculation compiler optimizations and hardware 
> store buffers (in all but sequentially consistent machines), 
> it is possible for A to "precede" B, but B to be executed 
> before A in real time.  Thus, some of the properties called 
> out for "precedes" do not hold.  Given these properties, we 
> really should not be calling this "precedes"...
> 
> So, going paragraph by paragraph, no doubt horribly 
> redundantly with earlier discussion, apologies in advance, 
> but doing it this way helps people spot any 
> misinterpretations that I might be falling prey to:
> 
> > 1.10p1: [omitted]
> > 1.10p2: [omitted]
> 
> > 1.10p3:
> > 
> > 	Two expression evaluations conflict if one of them modifies a
> > 	memory location and the other one accesses or modifies the same
> > 	memory location.
> 
> Probably need to add something about concurrency.  The intent 
> is clear to me, but the above could potentially be 
> misinterpreted as defining the following to be conflicting:
> 
> 	t = a;
> 	t++;
> 	a = t;
> 
> Of course, unless multiple threads can be executing this code 
> concurrently, the accesses to "a" do not conflict.  One 
> possible fix would be something like the following:
> 
> 	Two expression evaluations conflict if (1) they can be executed
> 	concurrently by different threads and (2) one of them modifies
> 	a memory location and the other one concurrently accesses or
> 	modifies the same memory location.
> 
> #1 is needed to handle the possibility of super-scalar 
> machines providing inter-thread parallelism.
This has been brought up before.  I'm actually inclined to leave it as
is, since the current definition is exactlu what's used in the April
1988 TOPLAS paper by Shasha and Snir, and as far as I can tell has
become more or less standard.  I don't think this is a substantive
issue, since the definition is used only in p11, where it is applied to
"two conflicting actions in different threads".

(Some much earlier versions of this proposal actually did define a
notion of intra-thread data race, corresponding to two conflicting
operations in the same thread that are not sequenced.  That was dropped
in favor of something closer to what was there before.)

> 
> > 1.10p4:
> > 
> > 	The library defines a number of operations, such as operations
> > 	on locks and atomic objects, that are specially identified
> > 	as synchronization operations. These operations play a
> > 	special role in making assignments in one thread visible
> > 	to another. A synchronization operation is either an acquire
> > 	operation or a release operation, or both, on one or more memory
> > 	locations. [Note: For example, a call that acquires a lock will
> > 	perform an acquire operation on the locations comprising the
> > 	lock. Correspondingly, a call that releases the same lock will
> > 	perform a release operation on those same locations. Informally,
> > 	performing a release operation on A forces prior side effects
> > 	on other memory locations to become visible to other threads
> > 	that later perform an acquire operation on A. -end note ]
> 
> OK, presuming that "both" maps to _seq_cst in Lawence's recent email.
> However, the possibility of _relaxed needs to be mentioned here.
In this formulation, relaxed operations are viewed as "atomic", but not
"synchronization operations".  I'm torn about whether this is the right
way to present things, but I think the current formulation is internally
consistent.  I expanded the note to make this clearer.

> 
> > 1.10p5-6, previously containing the definition of "inter-thread 
> > ordered before", have been deleted from this revision. Subsequent 
> > paragraphs will be renumbered eventually.
> > 
> > This was rewritten in terms of "synchronizes with", which is 
> > restricted to synchronization operations, instead of explicitly 
> > including store-load dependencies in a "communicates with" 
> relation as 
> > in N1944. This version is intended to be equivalent, since 
> we insist that "happens before"
> > together with store-load dependencies remains acyclic. We need that 
> > for the race free implies sequential consistency proof, and 
> for one of 
> > the examples.
> 
> More on this later...
> 
> > 1.10p7:
> > 
> > 	All modifications to a particular atomic object M occur in
> > 	some particular total order, called the modification order of
> > 	M.
> 
> Presumably this excludes _relaxed operations.  Consider a 
> program where each thread store-relaxes a unique value into a 
> single atomic variable.
> All ordered observations of that variable will see a sequence 
> of values consistent with at least one global order, but I 
> have seen cases where a single 16-CPU machine might up to 
> five different simultaneous opinions about the value of a 
> given variable, due to store buffering.  And this is with the 
> restriction that a reading CPU awaiting a cache miss has no 
> opinion -- otherwise, more than 10 simultaneous conflicting 
> opinions would not be out of the ordinary.
> 
> Nevertheless, ignoring wall-clock time, the sequence of 
> values seen by each CPU is in fact consistent with some 
> global ordering.  Usually consistent with some very large set 
> of potential global orderings.
> 
> So, yes, there is in some sense a total order of modification 
> for _relaxed operations on atomic variables, but this order 
> does not have much in common with "happens before".  It is 
> still possible to trace ordering through a series of such 
> accesses in some cases, but it is possible to get in trouble here.
> 
> This should not be surprising -- a big part of the motivation 
> for _relaxed is to avoid heavyweight fence operations, and 
> the ordering properties that they imply.
I think I agree with all of the observations, but I'm not convinced that
a substantive change is needed.
I added:

[Note: The modification orders reflect the order in which updates to a
single variable become globally visible. In the case of relaxed atomic
operations, both assignments to and reads from the object may
effectively be locally reordered, so the modification order may only be
very indirectly observable. Repeated load_acquire operations on the same
atomic object must observe some subset of the updates in modification
order. -end note ]
> 
> > 	   An evaluation A that performs a release operation on an
> > 	object M synchronizes with an evaluation B that performs an
> > 	acquire operation on M and reads either the value written by
> > 	A or a later value in the modification order of M. [Note: The
> > 	specifications of the synchronization operations define when
> > 	one reads the value written by another. For atomic variables,
> > 	the definition is clear. All operations on a given lock occur
> > 	in a single total order. Each lock acquisition "reads the value
> > 	written" by the last lock release. -end note ]
> 
> OK, releases match acquires.
I tentatively changed this to eliminate the "later value" part, based on
prior discussion.  I'm open to better ideas.  That change is clearly in
the "lesser evil" category.

> 
> > 1.10p8:
> > 
> > 	An evaluation A happens before an evaluation B if:
> > 
> >         * A is sequenced before B and either A performs an acquire
> > 	    operation or B performs a release operation; or
> >         * A synchronizes with B; or
> >         * for some evaluation X, A happens before X and X 
> happens before B.
> > 
> > 1.10p9:
> > 
> > 	An evaluation A precedes an evaluation B if:
> > 
> >         1. A happens before B; or
> >         2. A is an assignment, and B observes the value 
> stored by A; or
> >         3. for some evaluation X, A precedes X and X precedes B.
> 
> I believe that the presence of store buffers can result in loops in
> your "precedes" order.   In particular, different CPUs can 
> see different
> interleavings of value sequences for a given pair of atomic variables.
> 
> One issue is that B observing the value stored by A does not 
> guarantee that B's results will be observed to follow those 
> of A.  One reason is that B and A might share a store buffer, 
> so that B sees the results of A earlier than do other CPUs.  
> The cache line containing A's results might be delayed for 
> any number of reasons, while that for B's results might be 
> already in place and modifiable.
I agree.  And I ceratinly understand the issue.  But I don't see how
that's a problem with the current statement.

The current phrasing is a more precise way to state that the "visibility
ordering" (A is observed by B) is consistent with happens-before.
That's essentially the point of the happens-before relation.

Note that inthis formulation happens-before does not order operations in
the same thread, unless their visibility order is actually guaranteed.
(Sequenced-before is NOT a subset of happens-before.)

Do you have a test case in mind for which we actually get an unexpected
precedes cycle?

> 
> Therefore, #2 does not hold on real hardware, even on 
> relatively strong TSO machines.  It may seem strange that 
> someone could see a new value of a variable before the 
> corresponding assignment officially happened, but that really 
> is the way it works on all but strict sequentially consistent 
> machines.
I don't see how that causes a problem here.  Happens-before has a
technical definition here, which includes neither the modification
order, nor the program order.
> 
> > 1.10p10:
> > 
> > 	A multi-threaded execution is consistent if each thread observes
> > 	values of objects that obey the following constraints:
> > 
> >         * No evaluation precedes itself.
> 
> Not good, see above.
> 
> A weaker condition that -would- hold would be that there is 
> no cycle in the "precedes" graph containing either a 
> sequentially consistent operation or a "synchronized-with" 
> acquire-release pair.
There is no way to get a cycle without an acquire-release pair.  That's
the only way to get cross-thread happens-before relationships.

> At least I -think- this would hold.  
> Assuming that it does, it would be easy to validate.  It also 
> has intuitive appeal, if looked at correctly -- after all, if 
> you really want your operations to be ordered, you bloody 
> well need to use some ordering operations!!!  ;-)
> 
> Note that in the presence of some of the more aggressive 
> value-speculation compiler optimizations that have been 
> discussed on this list, the cycle could even consist of a 
> circular chain of assignments, each depending on the value 
> from the previous assignment in the chain.  (This does not 
> necessariy constitute an out-of-thin-air situation, because 
> there would presumably be checks that the speculated values 
> were in fact correct, and these checks would presumably not 
> depend on the speculated computation.)
> 
> >         * Each read access B to a scalar object M observes the value
> > 	    assigned to M by a side effect A only if there is no other
> > 	    side effect X to M such that
> >               o A is sequenced before or happens before X, 
> or A and X
> > 		  are synchronization operations and A precedes X in the
> > 		  modification order of M, and
> >               o X is sequenced before or happens before B.
> 
> Given the possibility of cycles in the sequenced-before 
> relation, not sure this works well.
That is not a possibility.  "sequenced before" is the evaluation order
within a thread.

> Besides, if A and X 
> execute concurrently, this relation is of almost no help to 
> the programmer.  You can't tell ahead of time which will win 
> the global-store-order sweepstakes, so you cannot conclude 
> anything from it.  If the "sequenced before"
> parts are discarded, then it has definite meaning.
> 
> But then there is the possibility of "out of thin air" values.
> 
> > 	[Note: The first condition implies that a read 
> operation B cannot
> > 	"see" an assignment A if B happens before A. It also prevents
> > 	cyclic situation in which, for example x and y are 
> initially zero,
> > 	one thread evaluates x = y; while another evaluates y = x;, each
> > 	sees the result of the other thread, and both x and y obtain a
> > 	value of 42. The second condition effectively asserts that later
> > 	assignments hide earlier ones if there is a well-defined order
> > 	between them. -end note ]
The middle statement about out-of-thin-air results was inadvertently
left in here.  My current plan is to go with a treatment of "out of thin
air" results that corresponds to the final section of N2176.
> 
> OK.  We are stuck with cycles in the "precedes" graph.  We 
> really want to say that any value loaded from an atomic 
> variable came from somewhere real.  So, what conditions do 
> compiler-writers apply to value speculation to prevent 
> out-of-thin-air values in practice? 
> 
> Suppose we instead build the data-flow graph and require that 
> -it- have no cycles?  For example, take IRIW on POWER:
> 
> 	T0: X=1;
> 	T1: Y=1;
> 	T2: r1=X; twi; isync; r2=Y;
> 	T3: r3=Y; twi; isync; r4=X;
> 
> The "precedes" graph has cycles:
> 
> 	X=1 => r1=X[1] -> r2=Y[0] => Y=1 => r3=Y=1 -> r4=X[0] => X=1
> 
> However, the "->" links are not present in the dataflow 
> graph, so there is no cycle in the dataflow graph, and thus 
> no out-of-thin-air values.
The r2=Y[0] => Y=1 and r4=X[0] => X=1 edges are not there in the
precedes relation either.  There are no load->store edges in either
happens-before or the visibility relation.

> Note that such a data-flow graph need not be actually 
> constructed by the compiler -- it simply excludes 
> out-of-thin-air values in the standard.
> This approach also allows full freedom to the compiler writer 
> and hardware implementer to use whatever combination of 
> strategies achieve the desired result.
> 
> Thoughts?
I think this is basically equivalent to the N2176 approach, with the
same advantages and problems.

> 
> > 1.10p11:
> > 
> > 	An execution contains an inter-thread data race if it 
> contains two
> > 	conflicting actions in different threads, at least one 
> of which is
> > 	not atomic, and neither happens before the other. Any 
> inter-thread
> > 	data race results in undefined behavior. A 
> multi-threaded program
> > 	that does not contain a data race exhibits the behavior of a
> > 	consistent execution.
> 
> Not sure what "consistent execution" means in this context.  
> It cannot mean "sequentially consistent", since a program 
> consisting only of _relaxed access to atomics would not be 
> guaranteed to be sequentially consistent.
> 
> I guess the thing that makes the most sense is that 
> "consistent execution"
> means that each load of a given variable sees either the 
> initial value or a value stored by some preceding operation, 
> and not some mashup of multiple values.  Is there something 
> else entailed by "consistent execution"?
See 1.10p10.

> 
> > 	                      [Note: It can be shown that programs that
> > 	correctly use simple locks to prevent all inter-thread data
> > 	races, and use no other synchronization operations, behave as
> > 	though the executions of their constituent threads were simply
> > 	interleaved, with each observed value of an object 
> being the last
> > 	value assigned in that interleaving. This is normally referred
> > 	to as "sequential consistency". However, this applies only to
> > 	race-free programs, and race-free programs cannot observe most
> > 	program transformations that do not change 
> single-threaded program
> > 	semantics. In fact, most single-threaded program transformations
> > 	continue to be allowed, since any program that behaves 
> differently
> > 	as a result must perform an undefined operation. -end note ]
> 
> Programs with data races cannot even expect loads to return 
> sensible values (at least in absence of common but 
> non-standard extensions).
> I believe that this issue is much more important than the 
> relation to sequential consistency.
I think "undefined behavior" has that implication for people who are
accustomed to reading the standard.

> 
> > 1.10p12:
> > 
> > 	[Note: Compiler transformations that introduce assignments to a
> > 	potentially shared memory location that would not be modified by
> > 	the abstract machine are generally precluded by this standard,
> > 	since such an assignment might overwrite another assignment by a
> > 	different thread in cases in which an abstract machine execution
> > 	would not have encountered a data race. -end note ]
> 
> Here "potentially shared" means any variable that cannot be 
> proven to be thread-local?
Right.  We could be more precise, but this would be in a non-binding
piece of the standard anyway.
> 
> Final point -- the name "precedes" is quite misleading.  An 
> operation A that "precedes" an operation B might in fact 
> happen -after- operation B by wallclock time, courtesy of 
> value-speculation optimizations and store buffers.  The 
> "precedes" relation is at best an after-the-fact artificial 
> ordering that is not necessarily visible to the programmer.
> How about something like "pseudo precedes" to make the 
> artificiality of this relation painfully clear?
I'm not sure this is any worse than "sequenced before" and "happens
before".  All of the logical names seem to have slightly misleading
temporal connotations.

> 
> 						Thanx, Paul
> 
> PS.  Please note that a string of _relaxed atomic operations to a
>      single variable does not necessarily generate a happens-before
>      relationship.  Hardware speculative execution is one way to get
>      parallel atomicity.  Architectures that populate memory with
>      ALUs is another -- in this case, the atomic operation might be
>      shipped to memory rather than the memory to the atomic operation.
>      Sufficiently weak memory machines can also have this property.

And they don't in this proposal, except between a store and a load that
ses the stored value.  Late stores in modification order do hide earlier
ones by 1.10p10, but modification order is intentionally not part of
happens-before.

I'll follow up with a draft of a shorter paper that contains only the
1.10 additions, with the changes mentioned above.

Thanks again for going through this.

Hans