[cpp-threads] OpenMP Memory Model

Fri Apr 14 21:46:58 BST 2006

> 1) The OpenMP spec ignores things like bit-fields, and the related
> issues of when one write may require reading and rewriting other
> objects.  That may not matter much in the OpenMP context.
>   
    I'm unclear on how C++ deals with bitfields but OpenMP doesn't worry 
about them explicitly. The general rule is that racing writes cause the 
variable in question to become undefined. When the writes are both to a 
bitfield, then its unclear how many bits become undefined. The same 
applies to large objects. If there are two racing writes to a given 
field of a class, then does the whole class become garbage? The latter 
is addressed in OpenMP by the fact that memory locations are referred to 
as "variables". Bitfield treadment, however, is unclear and should be 
clarified in the future.

> 2) It assumes that compilers cannot introduce new writes (e.g. as part
> of speculative register promotion).  We agree that's the proper rule,
> but I suspect it's no more true for current implementations than it is
> for C++.
>   
    The memory model in the paper does not make this assumption. It only 
provides execution rules for the writes that actually appear in the 
application's source code. This can be considered to be the "API 
treatment" of OpenMP. The programmer wrote a program in C++/Fortran and 
OpenMP. They use a number of functions of OpenMP's API, including reads, 
writes, locks, barriers, etc. The goal of the formal model is to specify 
to rules for how those API calls will operate. The underlying 
implementation (the compiler, for example) may do fancy things, 
including making additional calls to the API (such as adding writes) but 
as long as the API calls performed by the application are performed 
correctly, everything is fine.
    As such, the formal spec in the paper only addresses how OpenMP 
responds to the application's source code and nothing else. Note that 
the spec in the paper does place some limitations on the compiler in 
that it forces the compiler not to violate read-write dependencies 
between accesses to different variables (ex: in A=5; B=A; the write to B 
depends on the read to A) but that is the only restriction and it may 
not survive to the final spec.
> 3) The paper doesn't mention volatile.  I looked at the OpenMP 2.5 spec
> again.  That does give semantics to volatile, though the semantics don't
> look plausible to me.  (The implicit flush operation seems to be on the
> wrong side, at least when comparing it to Java volatile semantics.  And
> putting it at the previous/next sequence point seems to me to allow some
> really weird code, at the expense of optimization.)  The C++ memory
> model may or may not mention volatile, but I'm pretty sure it will not
> use the OpenMP semantics.
>   
    I haven't seen Java's volatile semantics but I don't understand why 
the flushes are on the wrong side. OpenMP's model for volatile borrows 
its intuition from release consistency. In RC you write something and 
then perform a release operation, which blocks until all prior 
operations have completed. If you read something you precede the read 
with an acquire operation, which completes before any subsequent 
operations may begin. Acquires and releases are totally ordered relative 
to each other. OpenMP's flush is a combination of both acquire and 
release and so the pattern for volatile variables is the same as above 
but with a flush operation being used in both cases.
    Note that if a variable has to be flushed immediately before a read 
and immediately after a write, this implies that the read cannot come 
from a register and the write cannot be to a register. Thus, OpenMP's 
semantics seem to essentially capture the original single-threaded 
meaning of volatile.
    Can you be specific about what optimizations become invalid given 
these semantics? Also, what kinds of alternate semantics for volatile 
have you been thinking of?

> 4) The notion of atomic operations seems quite different, in that
>
>   (a) There is no need to label reads of atomic variables specially.  An
> ordinary read that races with an atomic write is OK.  I don't think this
> is a good design decision, but it seems to be the way OpenMP works.
>
>   (b) Atomic operations are what we would call "unordered".  Given that
> the corresponding reads are not special and hence unordered, this
> probably doesn't lose anything. 
>
> I expect the resulting notion of atomics usually requires lots of
> explicit "flush"es, which I think are expensive to implement on
> architectures like X86 or IA64, since I think they require StoreLoad
> ordering.
>   
    You are correct. The idea behind atomic operations is twofold. There 
is the obvious bit about doing a read and a write as one atomic 
operation. Another reason is that regular writes are not allowed to race 
with each other or with reads, which makes any kind of synchronization 
via variables impossible. Atomic writes are allowed to race with each 
other and all reads.
    However, I can see how this can limit optimizations, as shown in 
your common subexpression elimination example below.

> 5) Is the notion of eclipsing reads designed to prevent variables from
> "flip flopping", as in
>
> Thread 1:
> x = 1;
> flush1;
> x = 2;
>
> Thread 2:
> flush2;
> r1 = x;
> r2 = x;
>
> outcome: flush1 Oflsh flush2 && r1 = 2 && r2 = 1?
>   
    I don't understand the details of your example. If flush1 precedes 
flush2 in the Flush Order then thread 1's write to x races with thread 
2's reads of x. I think that x=2 on thread 1 needs to be atomic in order 
for this to work as above.

> If so, I think that would get us back to the "reads kill" problem in the
> original Java memory model, which wasn't a good idea, mostly because it
> seemed to be too expensive to get anyone to bother to implement
> correctly.  The issue arises because in real cases, the reads in thread
> 2 may look more like:
>
> r0 = *y/17;
> r1 = *x;
> r2 = *y/17;
>
> With this rule, you can no longer eliminate the common subexpressions
> unless you know that *x and *y don't alias. 
>   
    You are correct. If there are a number of racing atomic writes then 
it is possible for the first read of *y to get one value while 
subsequent reads get another. This is actually a good example of why 
atomic reads should be identified as such since if we knew that *y may 
only be written to by regular writes, common subexpression elimination 
would be legal because the only way to break it would be via a race 
between the write and one of the reads, which would produce garbage data 
anyhow.

> 6) If a read is involved in a race, the value associated with the read
> is undefined, but it does not render the meaning of the entire program
> undefined.  I think that's problematic for C++, in that a race can
> result in a read of an uninitialized vtable pointer, which can normally
> result in a wild branch, and hence any behavior whatsoever.
> Implementing anything else often requires some sort of memory fence
> after the vtable initialization. 
>   
    I don't understand whether you're commenting on reads in general or 
reads to key internal C++ data structures. If the latter, then it is 
outside the scope of OpenMP since the vtable is not directly visible to 
the application (caveat: its been years since I've last touched C++ so I 
may be wrong on this) and is thus not explicitly part of the contract 
between the application and C++ or OpenMP. You are correct that if an 
implementation of C++ were written on top of OpenMP then it would have 
this issue and you correctly identify how this issue would be resolved. 
In particular, a barrier immediately following the initialization of 
basic C++ data structures would work nicely (barriers include flushes).
    In general I think that its key to separate the specification from 
the application's point of view and any kind of underlying details that 
have to exist in reality but are invisible to the application itself. 
Only the former needs to be explicitly specified while the latter needs 
to be considered from the performance point of view but not actually 
specified.

-- 
                             Greg Bronevetsky