[cpp-threads] [Javamemorymodel-discussion] there's a happens-before orderhere. right?

Mon Dec 15 20:38:34 GMT 2008

[I think all of this is relevant to Java only in that vaguely similar extensions have been considered there at times.  Most of it doesn't currently translate, in spite of the fact that the basic memory models are similar.]

> From: Alexander Terekhov [mailto:alexander.terekhov at gmail.com]
> On Sat, Dec 13, 2008 at 2:04 AM, Boehm, Hans
> <hans.boehm at hp.com> wrote:
> [...]
> >>
> http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html#exa
> >> mples
> >>
> >> would you label the following C++0x program
> >>
> >>     int data; //= 0
> >>     atomic<int> x; //= 0
> >>
> >>     thread 1:
> >>     ------------
> >>     data = 1;
> >>     x.store(1, release);
> >>
> >>     thread 2:
> >>     ------------
> >>     if (x.load(relaxed))
> >>       data = 2;
> >>
> >> data-race-free or not? Why? TIA.
> >>
> > This is well in the "escapes from sequential consistency" category,
> > and it doesn't currently have a Java analog.
>
> Yes. I'm actually driving at
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html
>
> > As I said, I was really trying to steer most people well away from
> > this kind of code.
>
> The problem is that std::atomic<> in SC mode is utterly
> expensive on POWER. Even in acquire-release mode it will
> inject way too much totally redundant (lw/i)sync(hronization).
I agree that this can be an issue on some current architectures, as it is with Java volatiles.  But the above example can be fixed by either:

1) Using an acquire load, or

2) Using a memory_order_consume load, and making the value stored into data data-dependent on the loaded value.

The former costs an isync.  The latter is very ugly, but only costs an add or so.  There is unfortunately a fairly high cost to sticking to SC in this case (which I would hope will decrease over time, at least for single socket implementations, but we don't know that for sure), but the low level approaches do get rid of most of the cost.

>
> > I'll admit it sometimes make sense in really performance
> critical code
> > that has to run on machines on which SC operations are expensive to
> > implement.
>
> If performance isn't critical then there isn't much need for
> std::atomic<>... I'd simply use locks.
In C++ at least, I wouldn't go that far.  Lock-free atomics tend to be useful for communication with signal handlers or other processes, where locks tend to cause deadlocks or be harmful to fault tolerance, respectively.

>
> >
> > This has a data race.
>
> This is quite surprising to me given your first two examples.
> I just can't conceive a platform that would produce a race on "data"
> in the example above and NOT break your first (and second)
> example as well by not respecting control dependency which
> imposes ordering for load-store sequence in thread 1 (and
> thread 2 in your second example).
>
> Let's take your first example and slightly change it by
> renaming y to "data" and using std::atomic<> for x:
>
>     int data; //= 0
>     atomic<int> x; //= 0
>
>     thread 1:
>     ------------
>     if (x.load(relaxed) > 0)
>       data = 1;
>
>     thread 2:
>     ------------
>     data = 2;
>
> This is still data-race-free, right? I hope so.
>
> Now let's introduce a store to std::atomic<> x in thread 2.
>
>     int data; //= 0
>     atomic<int> x; //= 0
>
>     thread 1:
>     ------------
>     if (x.load(relaxed) > 0)
>       data = 1;
>
>     thread 2:
>     ------------
>     data = 2;
>     x.store(-1, release);
>
> I just can't conceive how that could possibly introduce a
> race on "data"...
>
> The next step is to change x.store(-1, release) to x.store(+1,
> release) :
>
>     int data; //= 0
>     atomic<int> x; //= 0
>
>     thread 1:
>     ------------
>     if (x.load(relaxed) > 0)
>       data = 1;
>
>     thread 2:
>     ------------
>     data = 2;
>     x.store(+1, release);
>
> This just ought to be data-race-free as well!!!
I agree with Anthony's answers.  Only the last one has a data race.

The most convincing reasons for leaving it that way are:

1. Not all hardware enforces ordering as a result of control dependences.  For the hardware that does, the definitions of "control dependence" vary.  I'm not sure that any hardware would actually reorder the load and the store in thread 1, but I've been at this too long to bet large sums of money that they don't.

2. Compilers don't always respect this sort of control dependence.  They have to, when violating it would introduce a data race, as seems to be the case here.  But there are other cases, like the one below, where they generally wouldn't.

>
> > To see that you really have to go back to reasoning based
> on "happens
> > before".
>
> Well, then "happens before" needs some fixing, I believe.
>
> >
> > It's hard to conceive of a case in which this particular code could
> > actually cause problems in practice.  And one could conceive of
> > extensions to memory_order_consume in the standard to handle this.
> > But trying to generally guarantee ordering for the two memory
> > operations in the second thread gets tricky fairly quickly.  You
> > probably don't want the two memory operations in
> >
> > if (r1 = x.load(memory_order_relaxed)) {
> >    data.store(2, memory_order_relaxed); } else {
> >    data.store(2, memory_order_relaxed); }
> >
> > to be ordered, for example.
>
> Why not?
Any compiler that cares about code size is likely to transform this to

r1 = x.load(memory_order_relaxed);
data.store(2, memory_order_relaxed);

and go from there, rordering at will, and allowing the hardware to reorder.

Preventing this is hard, since the actual code may look like

inside f():

  r1 = x.load(memory_order_relaxed);
  g(r1);

inside g(), in a separate compilation unit:

  if (r1) {
    data.store(2, memory_order_relaxed); } else {
    data.store(2, memory_order_relaxed); }

Thus when the conditional inside g() is being compiled, the compiler has no idea that eliminating the branch may break the ordering.

If this is important enough to support, the right way is almost certainly to strengthen the semantics of memory_order_consume (which currently enforces ordering based on data dependendencies) to cover this case, rather than changing the meaning of memory_order_relaxed.  I wouldn't be opposed to such a proposal if it included some important use cases.  But we're running out of time for this iteration of the standard, and I don't think there's anything to prevent this kind of strengthening in a later round.

> The above is simply "high level" expression of "Safe
> Fetch" from Book II:
>
> http://www.power.org/resources/reading/PowerISA_V2.05.pdf
> (see B.2.3 Safe Fetch)
>
> -----
> If a load must be performed before a subsequent store
>
> [...]
>
> In this example it is assumed that the address of the storage
> operand to be loaded is in GPR 3, the contents of the storage
> operand are returned in GPR 4, and the address of the storage
> operand to be stored is in GPR5.
>
> -----
> If a load must be performed before a subsequent store [...]
>
> In this example it is assumed that the address of the storage
> operand to be loaded is in GPR 3, the contents of the storage
> operand are returned in GPR 4, and the address of the storage
> operand to be stored is in GPR5.
>
>   lwz  r4,0(r3) #load shared data
>   cmpw r4,r4    #set CR0 to "equal"
>   bne- $-8      #branch never taken
>   stw  r7,0(r5) #store other shared data
> -----
>
> If I would not care about implied ordering due to control
> dependency I'd simply write:
>
>   r1 = x.load(memory_order_relaxed);
>   data.store(2, memory_order_relaxed);
>
> Note that
>
>   r1 = x.load(memory_order_acquire);
>   data.store(2, memory_order_relaxed);
>
> is more expensive than "Safe Fetch"...
Right.  But the above is a PowerPC specific hack, which doesn't work on other architectures.  For example, Itanium does generally respect dependencies, but with a more restrictive definition, which certainly doesn't include not taken branches.  I'm not sure that ARM promises to enforce even the ordering in the original example.  Past experience suggests that we don't want to rely on properties that "all hardware should obviously enforce", unless they're explicitly stated.

Hans
>
> regards,
> alexander.
>