[cpp-threads] Slightly revised memory model proposal (D2300)

Thu Jun 28 18:58:01 BST 2007

Hans wrote on 06/25/2007 03:57:55 PM:

> > Even before considering the compiler, I don't think that this
> > can work. The hardware is free to reorder relaxed loads, even
> > to the same memory location. Particularly on PPC, there is
> > nothing that would prevent the second load from being
> > performed ahead of the first, and introduce some "apparent"
> > flickering. You need the first load to be an acquire to
> > ensure that they are done in program order. Cache coherence
> > applies to the order on which the loads are executed, which
> > is not necessarily their program order.
> I'm having trouble reconciling this with section 1.3.6 in PowerPC
> Book II, version 2.01.  Can you help?
>
> It states, for example: "That is, a processor or mechanism can
> never load a "newer" value first and then, later, load
> an "older" value."

Actually, that's section 1.6.3. And it seems that I've been interpreting
incorrectly. After consultation with other folks at IBM, the consensus is
that PPC will order two loads to the same memory location without the need
of any intervening fences.

However, I still think that this creates unnecessary restrictions to the
compiler. Plus, there could be future architectures that do not provide
this guarantee.

> > In any case, preventing reordering of atomic loads by the
> > compiler is definitely a big hammer that will disable many
> > compiler optimizations. I don't think this can be justified
> > by saying that atomic operations will not be frequent enough
> > to matter for performance.
> Remember that we're talking about compiler reordering of atomic
> loads from (at least potentially) the same memory location.  Do you
> have an example where that might matter?

Well, if at least one of the two loads happens through a pointer, the
compiler frequently has to assume that they may be aliased, and thus will
be prevented from reordering them. I believe a frequent situation will be
that the variables in fact do not overlap, but the compiler is unable to
prove it statically; in that case, we're preventing a transformation which
might be beneficial for no real gain.

Not being able to reorder these loads will prevent many transformations,
and particularly affect instruction scheduling and register allocation.
Given that relaxed atomics will likely be used on performance sensitive
parts of the application (otherwise the programmer would probably be using
locks), I think that these transformations might matter.

Now, looking at this from a different angle, given that this property is no
longer needed to support lock elimination, why is it being introduced to
the model? Isn't it reasonable to require an acquire operation (either an
acquire load or an acquire fence) after a relaxed load returns a value that
will trigger some action?

PS: I do have a related question. Does the current proposal allow the
coalescing of consecutive relaxed loads to the same variable? For example,
will it be acceptable for the compiler to transform:

      r1 = load_relaxed(&a)
      r2 = load_relaxed(&a)
      r3 = r1 + r2

into

      r1 = load_relaxed(&a)
      r3 = 2*r1

?

How about if the loads are acquire instead of relaxed?

Thanks,

> > Going back to the original problem of removability of locks,
> > I do not think that it is an issue, because on your example:
> >
> > > r1 = load_relaxed(&x); fetch_and_add_acq_rel(&dead1, 0); r2 =
> > load_relaxed(&x); fetch_and_add_acq_rel(&dead2, 0); r3 =
> > load_relaxed(&x);
> >
> > the three loads of x should not be ordered wrt each other.
> In the current formulation, they are indeed no longer ordered in any
> real sense.  They happen before each other, because they are
> sequenced, and hence appear ordered to that thread, but there is no
> way that another thread can observe the order.  And the behavior is
> no different than wihtou the fetch_and_add operations.  All of which
> I think is correct.
>
> The concern dealt with a version of the model that we've at least
> tentatively abandoned.  We should reconsider it if we have to go
> back to something closer to that model.
>
> Hans

--
Raúl E. Silvera         IBM Toronto Lab   Team Lead, Toronto Portable
Optimizer (TPO)
Tel: 905-413-4188 T/L: 969-4188           Fax: 905-413-4854
D2/KC9/8200/MKM