[cpp-threads] Slightly revised memory model proposal (D2300)

Mon Jun 11 02:13:07 BST 2007

On Sat, Jun 09, 2007 at 10:12:55PM -0700, Hans Boehm wrote:
>=20
> On Wed, 6 Jun 2007, Raul Silvera wrote:
>=20
> >
> > Hans, I mentioned this privately, but I wanted to make sure it is voi=
ced on
> > the list as well.
> >
> > I think it is a mistake to disallow Peter Dimov's example (for refere=
nce):
> >
> > Thread 1                      Thread 2                   Thread 3
> > X=3D1                           y=3D1                    if (load_acq=
uire(&v)=3D=3D2)
> > fetch_add_release(&v,1)       fetch_add_release(&v,1)	    assert (x+y=
=3D=3D2)
> >
> > I think this is too counterintuitive and borders on unusable. Also, i=
t
> > makes fetch_add_release(&v, 0) different from a noop.
>
> This certainly bothers me as well.
> >
> > Furthermore, I believe it is unnecessary, since PPC (which I assume i=
s the
> > architecture that is triggering this change) has mechanisms to effici=
ently
> > implement this particular situation. Basically, dependence ordering o=
n the
> > fetch_add_release causes a cheap acquire between the load and store o=
f v.
> > On other architectures, I suspect the acquire is implied or at least =
it is
> > cheap enough that it can always be introduced.
>
> Again, I think the problem is not that it is difficult to ensure that
> this example produces the desired outcome.  The problem is that if we
> include a rule that guarantees it, and combine it with other properties
> that seem similarly essential in small examples, we inadvertently
> constrain more complicated examples such that they are no longer
> efficiently implementable.  The example in question really is Sarita's.

The key question in Sarita's example seems to me to be the interaction
between cache coherence and ordering, but when multiple variables are
involved.

> > My proposal is to single out RMW operations and indicate they must at=
 least
> > have a local order between their load and store, which is I believe a=
ll
> > that is needed for this example to fit the model. Note that this does=
n't
> > require a change to the memory model, but to the definition of the RM=
W
> > operations.
>
> I have been assuming that atomic RMWs are viewed as single actions in t=
he
> ordering which, if I understand correctly, is at least as strong as
> what you are proposing.  At least in the current proposal, this doesn't
> help, since there is no happens-before ordering between the first
> and second fetch_add_release (or between the store of the first one
> and the load of the second, if you want to view it that way.)
>=20
> I think we do have the option of dropping the cache coherence requireme=
nt
> instead.  That would allow things like
>=20
> Thread 1:
> store_relaxed(&x, 1);
>=20
> Thread 2:
> store_relaxed(&x, 2);
>=20
> Thread 3:
> r1 =3D load_acquire(&x); (1)
> r2 =3D load_acquire(&x); (2)
> r3 =3D load_acquire(&x); (1)
>=20
> but would support Peter's example.

Eek!!!  In my mind, permitting stores to "flicker" into existence is
a cure that is far, far worse than any disease I can imagine.  Such
"flickering" would break quite a bit of code.

How about instead -slightly- weakening the ordering requirements,
so that ordered reads can disagree about the order of unordered stores
to different variables?

Looking at Sarita's example as given in Vijay's email dated March 19th,
but de-PPC-isizing it:

	T1: [1] X=3D2
	T2: [2] r2=3DX(2) [2a] acquire_fence() [3] r3=3DY(0)
	T3: [4] Y=3D1 [4a] release_fence()  [5] X=3D1
	T4: [6] r1=3DX(1) [6a] acquire_fence() [7] r2=3DX(2)

So how about the following:

The two stores to X are unordered -- they really happened concurrently,
and the system arbitrarily selected an order for them after the fact.
T1's reads are not sufficient to establish the ordering of the stores
to X and to Y -- unless the stores had in fact been explicitly ordered,
which they are not.  So T1 did in fact see X=3D2 before it saw Y=3D1,
but it cannot assume that other CPUs will see this same order.

As noted back in March, this can happen in real hardware due to things
like shared store buffers and caches.

If, on the other hand, the example was as follows:

	T2: [2] r2=3DX(2) [2a] acquire_fence() [3] r3=3DY(0)
	T3: [3x] X=3D2 [3y] release_fence() [4] Y=3D1 [4a] release_fence()  [5] =
X=3D1
	T4: [6] r1=3DX(1) [6a] acquire_fence() [7] r2=3DX(2)

In this case, T2's reads -can- attest to the store ordering of X and Y,
and T4's observed ordering would be disallowed.

Returning to the first example, and converting it to Sarita Normal Form:

	T1: X=3D1; fetch_add_release(&V,1);
	T2: Y=3D1; fetch_add_release(&V,1);
	T3: if (load_acquire(&V)=3D=3D2) assert (X+Y=3D=3D2);

Now, I would argue against fetch_add_relaxed() imposing a
happens-before between the two increments, so would be OK with the
following failing:

	T1: X=3D1; release_fence(); fetch_add_relaxed(&V,1);
	T2: Y=3D1; release_fence(); fetch_add_relaxed(&V,1);
	T3: if (load_acquire(&V)=3D=3D2) assert (X+Y=3D=3D2);

But it seems to me that we have the option of imposing the happens-before
in the case of the fetch_add_acquire(), fetch_add_release(), and
fetch_add_ordered() variants.  So, returning to the fetch_add_release()
variant, this would result in the following:

	X=3D1 -> T1:fetch_add_relaxed(&V,1)
	Y=3D1 -> T2:fetch_add_relaxed(&V,1)
	T1:fetch_add_relaxed(&V,1) -s> X+Y # since load_acquire gives 2
	T2:fetch_add_relaxed(&V,1) -s> X+Y # since load_acquire gives 2
	load_acquire(&V) -> X+Y

and also one or the other of:

	T1:fetch_add_relaxed(&V,1) -> T2:fetch_add_relaxed(&V,1)
	T2:fetch_add_relaxed(&V,1) -> T1:fetch_add_relaxed(&V,1)

Is this sufficient, or am I missing something?

> Effectively, racing stores can appear to repeatedly become visible and
> invisible again, but each load still has to return a consistent value.
> This also feels very unintuitive, but it may have less practical impact.

As you might have guessed from my initial reaction, I believe that this
"flickering stores" approach will cause more problems than giving up the
implied ordering we have been assuming for reads of independent stores.

> Modification order would be used only in defining "later stored values"
> in the synchronizes with definition, and wouldn't constrain visibility.
>=20
> I'm starting to think that this might be a better approach?
>=20
> (Again I'm uncomfortable about all of this without a better understandi=
ng
> of the PowerPC model than I currently have.  I'm currently just looking
> at ways to get "the right" answer for Sarita's example.)

The more I think about it, the more I am becoming convinced that
Sarita's example belongs with IRIW -- if you want it, use the _ordered
variants of the primitives.

						Thanx, Paul

> Hans
> >
> > --
> > Ra=FAl E. Silvera         IBM Toronto Lab   Team Lead, Toronto Portab=
le
> > Optimizer (TPO)
> > Tel: 905-413-4188 T/L: 969-4188           Fax: 905-413-4854
> > D2/KC9/8200/MKM
> >
> >
> >
> >
> >              "Boehm, Hans"
> >              <hans.boehm at hp.co
> >              m>                                                      =
   To
> >              Sent by:                  <paulmck at linux.vnet.ibm.com>,
> >              cpp-threads-bounc         <cpp-threads at decadentplace.org=
.uk>
> >              es at decadentplace.                                       =
   cc
> >              org.uk
> >                                                                    Su=
bject
> >                                        [cpp-threads] Slightly revised
> >              06/04/07 07:43 PM         memory model proposal (D2300)
> >
> >
> >              Please respond to
> >                 C++ threads
> >               standardisation
> >              <cpp-threads at deca
> >              dentplace.org.uk>
> >
> >
> >
> >
> >
> >
> > Attached is an early draft of a document with just the threads change=
s
> > from N2171, adjusted a little as a result of feedback from Paul and
> > others.
> >
> > Note that the "D" in D2300 indicates that this is not stable, and
> > multiple versions of this document will share the same number.
> > Hopefully I will remember to update the date.  This will eventually t=
urn
> > into N2300.
> >
> > Hans
> > (See attached file: D2300.html)--
> > cpp-threads mailing list
> > cpp-threads at decadentplace.org.uk
> > http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads
> >
>=20
> --
> cpp-threads mailing list
> cpp-threads at decadentplace.org.uk
> http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads