[cpp-threads] OpenMP Memory Model

Sat Apr 15 01:10:17 BST 2006

In response to Bronis' question, a preliminary proposal for a C++ memory
model and related material can be found at

http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/

The current proposal is somewhat informal, and will probably stay that
way, since it the final product is intended to be part of a revision to
the C++ standard.  However, if you see any technical issues with the
approach, we would love to know about them.

My impression is that we currently have strong support among the C++
committee.  There is a bit of apprehension about the rather substantial
compiler impact this will have, but I think there is general agreement
that it is unavoidable.

I have had some limited conversations about OpenMP memory models with
Barbara Chapman in the past, though I am clearly not an OpenMP expert.

Details below:

> -----Original Message-----
> From: Greg Bronevetsky [mailto:greg at bronevetsky.com] 
> Sent: Friday, April 14, 2006 1:47 PM
> To: Boehm, Hans
> Cc: Lawrence.Crowl at Sun.com; C++ threads standardisation; 
> bronis at llnl.gov
> Subject: Re: [cpp-threads] OpenMP Memory Model
> 
> 
> > 1) The OpenMP spec ignores things like bit-fields, and the related 
> > issues of when one write may require reading and rewriting other 
> > objects.  That may not matter much in the OpenMP context.
> >   
>     I'm unclear on how C++ deals with bitfields but OpenMP 
> doesn't worry about them explicitly. The general rule is that 
> racing writes cause the variable in question to become 
> undefined. When the writes are both to a bitfield, then its 
> unclear how many bits become undefined. The same applies to 
> large objects. If there are two racing writes to a given 
> field of a class, then does the whole class become garbage? 
> The latter is addressed in OpenMP by the fact that memory 
> locations are referred to as "variables". Bitfield treadment, 
> however, is unclear and should be clarified in the future.
The current proposal for C++ is to allow bit-field writes to read and
rewrite adjacent bit-fields in the same contiguous sequence of
bit-fields, but to never allow a bit-field write to overwrite an
adjacent non-bit-field.  This is a fairly strong constraint on
compilers, and I am not aware of any current compilers that satisfy it.
But I think there is a strong argument that this is both the right thing
to do, and not terribly expensive.
> 
> > 2) It assumes that compilers cannot introduce new writes 
> (e.g. as part 
> > of speculative register promotion).  We agree that's the 
> proper rule, 
> > but I suspect it's no more true for current implementations 
> than it is 
> > for C++.
> >   
>     The memory model in the paper does not make this 
> assumption. It only provides execution rules for the writes 
> that actually appear in the application's source code. This 
> can be considered to be the "API treatment" of OpenMP. The 
> programmer wrote a program in C++/Fortran and OpenMP. They 
> use a number of functions of OpenMP's API, including reads, 
> writes, locks, barriers, etc. The goal of the formal model is 
> to specify to rules for how those API calls will operate. The 
> underlying implementation (the compiler, for example) may do 
> fancy things, including making additional calls to the API 
> (such as adding writes) but as long as the API calls 
> performed by the application are performed correctly, 
> everything is fine.
>     As such, the formal spec in the paper only addresses how 
> OpenMP responds to the application's source code and nothing 
> else. Note that the spec in the paper does place some 
> limitations on the compiler in that it forces the compiler 
> not to violate read-write dependencies between accesses to 
> different variables (ex: in A=5; B=A; the write to B depends 
> on the read to A) but that is the only restriction and it may 
> not survive to the final spec.
I think this is the right approach.  However, I again believe that there
are relatively few compilers that actually satisfy it.  If you try
compiling

typedef struct s { int data; struct s * next; } s_t;

int count;

void f(s_t * p) {
    s_t *x;
    for (x = p; x != 0; x = x-> next) {
	if (x->data > 0) ++count;
    }
}

in many cases (incl. gcc4.0.2 on x86) you will discover that count is
promoted to a register during the loop, meaning that it is written
unconditionally, even if the list has only negative elements.  Hence an
invocation of f() on a list with negative elements conflicts with
another update to count, even though there are no writes in the source.

I think you are effectively prohibiting this, which is good, and exactly
what we are doing.

> > 3) The paper doesn't mention volatile.  I looked at the OpenMP 2.5 
> > spec again.  That does give semantics to volatile, though the 
> > semantics don't look plausible to me.  (The implicit flush 
> operation 
> > seems to be on the wrong side, at least when comparing it to Java 
> > volatile semantics.  And putting it at the previous/next sequence 
> > point seems to me to allow some really weird code, at the 
> expense of 
> > optimization.)  The C++ memory model may or may not mention 
> volatile, 
> > but I'm pretty sure it will not use the OpenMP semantics.
> >   
>     I haven't seen Java's volatile semantics but I don't 
> understand why the flushes are on the wrong side. OpenMP's 
> model for volatile borrows its intuition from release 
> consistency. In RC you write something and then perform a 
> release operation, which blocks until all prior operations 
> have completed. If you read something you precede the read 
> with an acquire operation, which completes before any 
> subsequent operations may begin. Acquires and releases are 
> totally ordered relative to each other. OpenMP's flush is a 
> combination of both acquire and release and so the pattern 
> for volatile variables is the same as above but with a flush 
> operation being used in both cases.
>     Note that if a variable has to be flushed immediately 
> before a read and immediately after a write, this implies 
> that the read cannot come from a register and the write 
> cannot be to a register. Thus, OpenMP's semantics seem to 
> essentially capture the original single-threaded meaning of volatile.
>     Can you be specific about what optimizations become 
> invalid given these semantics? Also, what kinds of alternate 
> semantics for volatile have you been thinking of?
Consider what's probably the canonical use case: One threads sets up a
data structure, and then sets a volatile flag indicating that it's done
(a release operation in the RC sense).  The second thread reads the data
structure after reading the flag (an acquire operation in the RC sense):

Thread 1:
x = ...
flag = true;

Thread 2:
if (flag)
  read x

You presumably also need a #pragma omp atomic before the assignment to
flag.  I'll omit that, but assume it's there.

The OpenMP spec seems to say that since flag is volatile, this is
effectively transformed to

Thread 1:
x = ...
atomic flag = true;
flush;

Thread 2:
flush;
if (flag)
  read x;

This seems to do nothing useful to enforce ordering at all.  The writes
to flag and x can become visible out of order, as can the reads.  I
think you want:

Thread 1:
x = ...
flush;
flag = true;

Thread 2:
if (flag) {
  flush;
  read x;
}

That way if flag reads as true, the second flush follows the first, and
I'm guaranteed to see the correct value of x.

This latter version would be vaguely similar to the Java approach (which
includes the equivalent of "atomic" in "volatile"), but I think still
has some other secondary problems:

1) You are requiring a full memory fence, where weaker and cheaper
primitives like Itanium's ld.acq and st.rel, or regular SPARC TSO loads
and stores are sufficient to give you the acquire/release semantics I
think you really want.  (Java actually gives you stronger and more
expensive semantics, by requiring that a volatile store followed by a
volatile load cannot be reordered.  That's almost always more than you
need, but makes things simpler.  And it's still much cheaper to
implement than what OpenMP requires, since on most architectures it
requires the fence only on the write side.)

2) I think that the placement at the previous and next sequence point
has weird implications, e.g. it means that if v is volatile, and we take
the current definition of volatile as written, then the reads of x and y
in

r1 = x; f(y, v);

are guaranteed to be ordered.  That seems highly counterintuitive, and
I'd be amazed if any implementations actually put a fence there.
Similarly, it's not obvious that you can eliminate common subexpressions
in:

r1 = x/17; f(x/17, v);

since there is an implied intervening flush.

And I'm not sure the notion of "previous sequence point" is well defined
here anyway.  Consider

r1= x; f((a, b), v)

> ...
> > 5) Is the notion of eclipsing reads designed to prevent 
> variables from 
> > "flip flopping", as in
> >
> > Thread 1:
> > x = 1;
> > flush1;
> > x = 2;
> >
> > Thread 2:
> > flush2;
> > r1 = x;
> > r2 = x;
> >
> > outcome: flush1 Oflsh flush2 && r1 = 2 && r2 = 1?
> >   
>     I don't understand the details of your example. If flush1 precedes
> flush2 in the Flush Order then thread 1's write to x races 
> with thread 2's reads of x. I think that x=2 on thread 1 
> needs to be atomic in order for this to work as above.
I agree.  I meant to label the stores as atomic.
> 
> > If so, I think that would get us back to the "reads kill" 
> problem in 
> > the original Java memory model, which wasn't a good idea, mostly 
> > because it seemed to be too expensive to get anyone to bother to 
> > implement correctly.  The issue arises because in real cases, the 
> > reads in thread
> > 2 may look more like:
> >
> > r0 = *y/17;
> > r1 = *x;
> > r2 = *y/17;
> >
> > With this rule, you can no longer eliminate the common 
> subexpressions 
> > unless you know that *x and *y don't alias.
> >   
>     You are correct. If there are a number of racing atomic 
> writes then it is possible for the first read of *y to get 
> one value while subsequent reads get another. This is 
> actually a good example of why atomic reads should be 
> identified as such since if we knew that *y may only be 
> written to by regular writes, common subexpression 
> elimination would be legal because the only way to break it 
> would be via a race between the write and one of the reads, 
> which would produce garbage data anyhow.
Yes.  And that would get you much closer to Java volatiles, or the
likely semantics of C++ atomics.

> 
> > 6) If a read is involved in a race, the value associated 
> with the read 
> > is undefined, but it does not render the meaning of the 
> entire program 
> > undefined.  I think that's problematic for C++, in that a race can 
> > result in a read of an uninitialized vtable pointer, which can 
> > normally result in a wild branch, and hence any behavior whatsoever.
> > Implementing anything else often requires some sort of memory fence 
> > after the vtable initialization.
> >   
>     I don't understand whether you're commenting on reads in 
> general or reads to key internal C++ data structures. If the 
> latter, then it is outside the scope of OpenMP since the 
> vtable is not directly visible to the application (caveat: 
> its been years since I've last touched C++ so I may be wrong 
> on this) and is thus not explicitly part of the contract 
> between the application and C++ or OpenMP. You are correct 
> that if an implementation of C++ were written on top of 
> OpenMP then it would have this issue and you correctly 
> identify how this issue would be resolved. 
> In particular, a barrier immediately following the 
> initialization of basic C++ data structures would work nicely 
> (barriers include flushes).
>     In general I think that its key to separate the 
> specification from the application's point of view and any 
> kind of underlying details that have to exist in reality but 
> are invisible to the application itself. 
> Only the former needs to be explicitly specified while the 
> latter needs to be considered from the performance point of 
> view but not actually specified.
The case I'm worried about is (p initially zero, T has "virtual", i.e.
overidable, member function foo):

Thread 1:
	p = new T();  // Implicitly writes vtable, containing pointer to
foo.

Thread 2:
	r1 = p;
	if (r1 != 0) p -> foo();

What does it mean for the read of p to return an undefined value?  I
guess that so long as that allows p -> foo() to exhibit completely
undefined behavior, I guess we're in agreement.  If it doesn't, then I
think this becomes difficult to implement.

What happens with this example if the write to p in thread 1 is atomic?
If I then need to guarantee that p -> foo() doesn't take a wild branch,
I may need a memory fence in thread 2.  But without atomic reads, I have
no hint that might be happening.
> 
> -- 
>                              Greg Bronevetsky
> 
>