[cpp-threads] Yet another visibility question

Tue Dec 19 23:51:21 GMT 2006

Backing up a minute, and trying to reconstruct how we got here in the
first place:

1) We want unordered atomic operations, since they are cheaper in some
contexts, and are occasionally sufficient.  They are unfortunately,
tricky to use correctly and, as we keep discovering, even trickier to
define correctly.

2) Given that we have unordered atomics, there are two distinct reasons
for wanting to impose some ordering on them anyway, in case of two
clearly dependent operations:

(a) If we don't do so, we don't have an tractable way to prohibit "out
of thin air" results.  The canonical
examples is (everything except r? "registers" unordered atomic,
initially zero):

Thread 1: r1 = x; y = r1;
Thread 2: r1 = y; x = r2;
Questionable outcome: r1 = r2 = 42.

With simple happens-before consistency, this is allowed if there is no
ordering between the two assignments in each thread, since each load can
see the result of the store in the other thread.  So long as we pick
identical values for r1 and r2 we can construct an "execution" that
justifies it.  Allowing this is somewhat nasty , since it makes it
harder to reason about programs.  Can I potentially generate a pointer
to a local variable whose address wasn't taken "out of thin air" in this
way?  (This probably doesn't really affect the compiler, since real
implementations will not allow this.  But it causes a problem for higher
level tools, I suspect.  For Java, the problem is much worse for
security reasons.)

(b) Some ordering properties along these lines in fact seem to be
provided by all hardware.  It would be nice to take advantage of those.
Sometimes such dependency-based ordering can be used to avoid the
overhead of explicit ordering.  One such example is Peter's attempt to
minimize ordering constraints in reference counting.  This was reduced
to

>>>> // x y initially zero
>>>>
>>>> // threads 1 to N:
>>>>
>>>> assert( x == 0 );
>>>>
>>>> if( fetchadd_release( &y, 1 ) == N - 1 ) {
>>>>    x = 1;
>>>> }
where we need the fetchadd_release to become visible before "x = 1;"

3) Some of the (2b) examples, like Peter's here, require dependency
based ordering for at least an atomic unordered load (such as the one
that's essentially part of fetchadd_release) and an ordinary store.

My earlier proposal to deal with this was
http://www.decadentplace.org.uk/pipermail/cpp-threads/2006-December/0012
11.html .  Ben pointed out a relatively uncontroversial correction in
http://www.decadentplace.org.uk/pipermail/cpp-threads/2006-December/0012
12.html and I think 1.10p7 can be further simplified by relying on
transitive ordering.

The most relevant changes were

1) for 1.10p5 to order an unordered atomic load followed by even an
ordinary store, and

2) for 1.10p7 to have all atomic reader-writer correspondences, even
those involving unordered operations, induce synchronizes with ordering.
Acquire-release operations would now differ only in the
inter-thread-ordered-before relationships.

I now think the proposed 1.10p5 change doesn't work, and I have serious
doubts about the original.  The fundamental problem is that all of this
requires that the compiler has to preserve dependencies, and that may
invalidate optimizations in compilation units that don't mention
atomics.

The proposed condition for dependence-based inter-thread-ordered-before
was

"A is an unordered atomic read and B writes a value which depends on the
value read by A, or the execution of B is conditioned on the value read
by A."

I was thinking about trying to resolve this with a restatement (which
would require translation to proper standardese, if it had gotten that
far):

"A is an unordered atomic read and B, an evaluation that occurs as part
of evaluation of an expression E in A's
function, writes a value which depends on the value read by B.  "Depends
on" here should be taken to mean that  if A were to read some other
value consistent with its type, then the evaluation of E either

	- Does not take place, or
	- Does not result in the same value being written to the same
object"

This would mean that if I write

r1 = y_initialized.load_raw();
if (r1) {
  y.x = 1;
} else {
  y.x = 1;
}

then each assignment to y.x is inter-thread-ordered-after the load_raw,
since each of the y.x = 1 assignments may not take place depending on
the value of r1.

However, if I instead write

r1 = y_initialized.load_raw();
foo(r1);

and someplace else

void foo(bool r1)
{
  if (r1) {
    y.x = 1;
  } else {
    y.x = 1;
  }
}

This does seem to be (barely) implementable, since the load_raw
optimization impact is contained to the function and hence compilation
unit in which it is contained.  But it has several remaining problems.
I think the showstopper is that even the first example breaks if I make
the load_raw call in a separate function.  Hence I would expect it to
only work in practice for badly structured code.

Better ideas?

Obvious alternatives include:

- A more Java-like causality treatment.  That is still considerably more
complicated to state
(See
http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.4.8
)
and I'm not sure it's quite as precise as we would like with respect to
correspondence
of actions to source expressions either, and that seems to be one of the
sticking points here.)  There are probably other contenders here, like
Vijay Saraswat's recent work (http://www.saraswat.org/rao.html), but I'm
not convinced that any of these could easily go into a language
standard.

- Dropping unordered atomics.  I'm unconvinced that's viable on
architectures like PowerPC and ARM, where acquire loads and release
stores require a fence.  It would result in a lot of platform-specific
low-level code for those platforms.

- Allowing "out of thin air" results for unordered atomics, guaranteeing
no dependency-based ordering, and putting up with more expensive
reference counting, etc.

> -----Original Message-----
> From:  Peter Dimov
> Sent: Tuesday, December 19, 2006 10:11 AM
> To: C++ threads standardisation
> Subject: Re: [cpp-threads] Yet another visibility question
> 
> Hans Boehm wrote:
> 
> > Otherwise consider a variant of the above example:
> >
> > r1 = y_initialized.load_raw();
> > if (r1) {
> >   y.foo();
> > } else {
> >   y.bar();
> > }
> >
> > Assume that y_initialized is set in another thread after 
> initializing 
> > y. I'm now in a situation in which this code is correct 
> unless foo() 
> > and bar() happen to assign the same value to the same field 
> of y.  In 
> > that case, the stores would no longer be dependent, and hence could 
> > happen before initialization.  But clearly it shouldn't be 
> my business 
> > to know how
> > foo() and bar() are implemented.
> 
> If we assume that foo and bar are opaque, the code is 
> incorrect anyway due to a possible data race between the 
> potential reads from y in foo and bar and the initialization 
> of y in the other thread. 
> 

But with the proposed changes, this would no longer be a race.  We would
have

initialization of y inter-thread-ordered-before
release store of y_initialized synchronizes-with
y_initialized.load_raw() inter-thread-ordered-before
foo() or bar() calls

Hans