[cpp-threads] Alternatives to SC

Mon Jan 29 06:48:29 GMT 2007

I finally managed to somewhat catch up with this thread.  I'd like
to quickly clarify my position, since it diverges from what's been
discussed here.  I'll put together and/or fix some web pages to
explain some of this:

1) I would ideally like to see a memory model with the property that
every data-race-free program that does not use low-level atomics
behaves sequentially consistently.  Here "data race free" is ideally
defined as potentially simultaneous conflicting actions, where
conflicting actions are defined more or less as usual.  Aside from
the exclusion for low-level atomic operations, this is essentially
the pthread model.  It does have some costs that we need to quantify.
But it has the, in my opinion huge and unique, advantage that it is
comprehensible to programmers.

2) I do not believe that any ordering constraints based on dependencies
are viable at this level.  I agree that they would be useful.  But all
attempts at definition have failed, typically because they end up
imposing compilation constraints on compilation units that do not
use atomics.  Peter and I have been through long discussions of this
that are in the archives.  I will try to post a summary shortly.
The nice thing about atomics with acquire and release ordering constraints
is that none of this matters.  (This does not mean that dependency-based
oredring is useless at the machine level.  It is critical for implementing
Java final fields.  But there the compiler knows what it's doing and
the dependency doesn't span compilation units.  The JSR133 effort
arrived at a similar conclusion for explicit dependency-based ordering.)

3) The current N2052 memory model proposal disallows all data races
on ordinary variables.  This again is consistent with the Posix rules,
and there are strong reasons for doing so.  Reading a concurrently
modified data location, as in the counter example, is unsafe with
current compilers that may correctly perform analyses that assume such
changes do not occur.  Canonical example:

	r1 = x;
	if (r1 < 2) {
	    ...
	    switch(r1) {
	      case 0:
	      case 1: ...
	    }
        }

Under suitable conditions, a compiler may

- Spill r1 during the first ellipses.
- Reload r1 from x in the switch expression.
- Implement the switch statement using a branch table, and omit the
range check, since r1 is known to be in range.

This means that a concurrent change to x, even if it is just used
for statistical purposes, will result in a wild branch.

For Java, there are strong reasons to prevent such compiler optimizations.
For C++, I believe there are not, especially since they are currently
fairly standard.  And there is no reason not to require programmers
to label races.

(I think there have been Linux kernel discussions about dropping
volatile qualifiers for such loads.  Perhaps that's even happened.
I think that's not only technically wrong, it's dangerous.)

(Of course all of this also breaks trivially if the load of x is
not atomic, as might be the case on some embedded hardware.  Thus
even a volatile qualifier isn't guaranteed to work, but is probably
sufficient in practice for Linux.)

4) I have very mixed feelings about load or store fences at this level.
I think they do make sense for strict RCU and a few similar applications.
I also think they tend to lead to subtlely buggy code, especially in the
absence of dependence-based ordering for applications that involve
handing off an object to another thread, which may then write it.
I currently haven't seen enough evidence that adding this kind of
ordering constraint (or fence) actually improves performance a lot
on major platforms.  If it does, I could live with adding it to
low-level atomics, though it would make the interface even wider.

5) (relatively minor) Several of us believe that X86 loads are
not reordered, at least on modern hardware and write-back cacheable
memory.  I do agree that the spec makes that less than 100% clear,
something I hope will be fixed.  This assumption is implicit in some
prior discussions.  (It is clearly documented in the X86 part of
the Itanium documentation, but that isn't known to apply to all
X86 processors.)

Hans