[cpp-threads] Slightly revised memory model proposal (D2300)

Fri Jun 22 18:08:50 BST 2007

On Fri, Jun 22, 2007 at 02:22:30AM -0000, Boehm, Hans wrote:
> Here are the latest revisions of:
> 
> The D2300 memory model.  I sort of followed Lawrence's semi-suggestion
> and included sc stores (not loads) with RMW operations in defining
> synchronizes with.  I also cleaned up the discussion of race-free
> executions in a way that I think is necessary, but may require Clark to
> clean up after my standardese again.
> 
> Some text (mm_rules) to be included in the atomics proposal to explain
> how it meshes with the memory model.
> 
> The sequential consistency proof updated to reflect the memory model
> revisions, and hopefully fixing some other issues.
> 
> Further bug reports would be appreciated.  The C++ paper deadline is
> officially tomorrow.
> 
> Hans

I will need to do more analysis, but based on my first perusal, this does
look like a great improvement over the earlier version!!!  For example,
it seems clear how to integrate semantics of N2262 fences into this draft.
In contrast, I had no idea how to proceed given the earlier draft.

More careful analysis will take some time, which unfortunately cannot
happen before your deadline.  But in the meantime, some initial comments
(which might be missing subtle changes in your most recent revision):

> Changes to the corresponding section of N2171 include:
> 
>     * Various changes to notes, including fixing an editing mistake in
>       1.10p10, and the addition of some explanatory notes.
>     * Switched to a weaker "synchronizes with" formulation in which a
>       load must see either the value of the store itself, or a derivative
>       obtained by a sequence of RMW operations.
>     * Rephrased the interaction of modification order and visibility
>       in 1.10p10. This version imposes stronger restrictions, and
>       generally disallows "flickering" of values.
>     * Switched to a more conventional happens-before definition, which
>       includes sequenced-before.
>     * Switched to a more conventional formulation in which the "precedes"
>       relation was replaced by a statement that no evaluation can see
>       a store that happens after it, or is sequenced after it.
>     * Explicitly mentioned the input when referring to a data race.
>       A program may exhibit a data race on some inputs and have
>       well-defined semantics on others.

The addition of this last is very helpful!

>       This avoids some concerns about synchronization elimination. In
>       particular, the old formulation allowed (everything initially
>       zero, atomic):
> 
>       Thread1 	Thread2
>       r1 = x.load_relaxed(); // yields 1
>       y.store_relaxed(1); 	r1 = y.load_relaxed(); // yields 1
>       x.store_relaxed(1);
> 
>       but disallowed the corresponding test case if the two statements in
>       each thread were separated by acq_rel read-modify-write operations
>       to dead variables, or a locked region. That interferes with
>       the elimination of locks if the compiler decides to statically
>       combine threads, which is likely to become important for certain
>       programming styles.

But doesn't the above commentary go with the previous point ("Switched
to a more conventional...")?

[ . . . ]

> 1.10p7:

[ . . . ]

>     An evaluation A that performs a release operation on an object
>     M synchronizes with an evaluation B that performs an acquire
>     operation on M and reads either the value written by A or, if
>     the following (in modification order) sequence of updates to M
>     are atomic read-modify-write operations or ordered atomic stores,

Again, I believe that this should instead read "non-relaxed
read-modify-write operations" in order to allow some common data-element
initialization optimizations.

>     a value written by one of these read-modify-write operations or
>     ordered stores. [Note: In some cases, particularly for sequentially
>     consistent atomic operations, the library specification strengthens
>     this to acquire operations that read any later (in modification
>     order) value. Except in those cases, reading a later value does not
>     necessarily ensure visibility as described below. Such a requirement
>     would sometimes interfere with efficient implementation. —end note ]
>     [Note: The specifications of the synchronization operations define
>     when one reads the value written by another. For atomic variables,
>     the definition is clear. All operations on a given lock occur in a
>     single total order. Each lock acquisition "reads the value written"
>     by the last lock release. —end note ]

And explicit memory fences now fit naturally into the "synchronizes with"
definition, perhaps something like the following:

	An evaluation A synchronizes with an evaluation D in presence
	of acquire and release memory fences if (1) the evaluation
	A is sequenced before a fence with acquire semantics which
	is sequenced before an evaluation B, (2) an evaluation C
	is sequenced before a fence with release semantics which
	is sequenced before the evaluation D, and (3) evaluation
	C reads from an object M either the value written by B or
	some following value (in modification order), but only in
	the case that all intervening modifications to M have been
	carried out by non-relaxed read-modify-write operations.
	[Note: using N2262 nomenclature, acquire_memory_fence(),
	acq_rel_memory_fence(), and ordered_memory_fence() have acquire
	semantics, while release_memory_fence(), acq_rel_memory_fence(),
	and ordered_memory_fence() have release semantics. ---end note]

In other words, manual insertion of acquire and release fences have
the same semantics as the equivalent acquire and release operations.

However, this does not capture all of the much stronger semantics of
ordered fences (e.g., Dekker and IRIW).  One way to capture this might
be as follows, though this does need more thought:

	An evaluation A synchronizes with an evaluation D in presence of
	ordered memory fences if (1) the evaluation A is sequenced before
	an ordered fence which is sequenced before an evaluation B, (2)
	an evaluation C is sequenced before an ordered fence which is
	sequenced before the evaluation D, and (3) evaluation B precedes
	evaluation C with respect to some memory object M's modification
	order.	Evaluation B can precede C in any of the following
	ways, all with respect to M's modification order: (a) C reads a
	value from M that is either written by B or some later value,
	(b) C writes a value to M that follows that read from M by B,
	(c) C writes a value to M that follows that written to M by B,
	or (d) C reads a value from M that follows that read from M by B.

Peter, Raul, thoughts?

> 1.10p10:

[ . . . ]

>     A multi-threaded execution is consistent if each thread observes
>     values of objects that obey the following constraints:
> 
>         * No evaluation happens before itself.
>         * No read access B to a scalar object observes a side effect A,
> 	  such that B happens before A.
>         * Each read access B to a scalar object M observes the value
> 	  assigned to M by a side effect A only if there is no other
> 	  side effect X to M such that
>               o A happens before X, and
>               o X happens before B. 
>         * If a side effect A to scalar object M happens before another
> 	    side effect B to the same scalar object M, then A must precede
> 	    B in Ms modification order.
>         * If read access A to scalar B object M observes a value a, and
> 	    A happens before read access B to the same scalar object M which
> 	    observes b, then the corresponding assignment of b to M may
> 	    not precede the assignment of a to M in M's modification order.

My guess is that "scalar B object M" needs to instead be "scalar object M".
This assumes b!=a, so shouldn't this be made explicit?

On the "scalar" modifier used here, this means that we are not proposing
atomic access to arbitrary structures, for example, atomic<struct foo>? 
(Fine by me, but need to know.)  However, we -are- requiring that
implementations provide atomic access to large scalars, correct?  So that
a conforming 8-bit system would need to provide proper semantics for
"atomic<long long>"?  Not that there are likely to be many parallel 8-bit
systems, to be sure, but same question for 32-bit systems and both long
long and double.  (Again, fine by me either way, but need to know.)

Although the general thrust of 1.10p8 and 1.10p10 look OK to me, more
careful analysis will be required -- which won't happen by your deadline,
sorry to say!

						Thanx, Paul