[cpp-threads] Alternatives to SC

Mon Jan 29 18:42:53 GMT 2007

On Sun, Jan 28, 2007 at 10:48:29PM -0800, Hans Boehm wrote:
> I finally managed to somewhat catch up with this thread.  I'd like
> to quickly clarify my position, since it diverges from what's been
> discussed here.  I'll put together and/or fix some web pages to
> explain some of this:
> 
> 1) I would ideally like to see a memory model with the property that
> every data-race-free program that does not use low-level atomics
> behaves sequentially consistently.  Here "data race free" is ideally
> defined as potentially simultaneous conflicting actions, where
> conflicting actions are defined more or less as usual.  Aside from
> the exclusion for low-level atomic operations, this is essentially
> the pthread model.  It does have some costs that we need to quantify.
> But it has the, in my opinion huge and unique, advantage that it is
> comprehensible to programmers.

In other words, programs that use pthread_mutex_lock() and friends in
the classic manner (e.g., not accessing lock-protected data from outside
the corresponding lock), correct?  If so, makes sense -- sequential
consistency is a property of locks and should also be of events.

> 2) I do not believe that any ordering constraints based on dependencies
> are viable at this level.  I agree that they would be useful.  But all
> attempts at definition have failed, typically because they end up
> imposing compilation constraints on compilation units that do not
> use atomics.  Peter and I have been through long discussions of this
> that are in the archives.  I will try to post a summary shortly.
> The nice thing about atomics with acquire and release ordering constraints
> is that none of this matters.  (This does not mean that dependency-based
> oredring is useless at the machine level.  It is critical for implementing
> Java final fields.  But there the compiler knows what it's doing and
> the dependency doesn't span compilation units.  The JSR133 effort
> arrived at a similar conclusion for explicit dependency-based ordering.)

I am OK with requiring the programmer to mark dependencies that matter,
as is done via rcu_dereference() usage in the Linux kernel.  Then any
unmarked dependency is the compiler's to do with as it will.

Does this work for you?

> 3) The current N2052 memory model proposal disallows all data races
> on ordinary variables.  This again is consistent with the Posix rules,
> and there are strong reasons for doing so.  Reading a concurrently
> modified data location, as in the counter example, is unsafe with
> current compilers that may correctly perform analyses that assume such
> changes do not occur.  Canonical example:
> 
> 	r1 = x;
> 	if (r1 < 2) {
> 	    ...
> 	    switch(r1) {
> 	      case 0:
> 	      case 1: ...
> 	    }
>         }
> 
> Under suitable conditions, a compiler may
> 
> - Spill r1 during the first ellipses.
> - Reload r1 from x in the switch expression.
> - Implement the switch statement using a branch table, and omit the
> range check, since r1 is known to be in range.

Assuming that r1 is local, this should be no problem.  If r1 is shared,
these would indeed not be good.  But marking r1 to indicate that it is
shared might make sense.  Note that current compilers can emit heavy
weight code for volatile variables, so current volatile is not
sufficient.

> This means that a concurrent change to x, even if it is just used
> for statistical purposes, will result in a wild branch.

Yep!!!

> For Java, there are strong reasons to prevent such compiler optimizations.
> For C++, I believe there are not, especially since they are currently
> fairly standard.  And there is no reason not to require programmers
> to label races.
> 
> (I think there have been Linux kernel discussions about dropping
> volatile qualifiers for such loads.  Perhaps that's even happened.
> I think that's not only technically wrong, it's dangerous.)

The problem with volatile is that some compilers go overboard.  To handle
the situations above, one must only enforce single-access discipline,
the memory barriers that many compilers emit are not needed.

Please note also that the Linux kernel uses the barrier() primitive
to handle many of these situations -- the proposals are -not- simply
to drop "volatile", but rather to replace it with other primitives,
including barrier().  In your example, if x is shared and r1 is local
(not shared), then in Linux one might write:

 	r1 = x;
	barrier();
 	if (r1 < 2) {
 	    ...
 	    switch(r1) {
 	      case 0:
 	      case 1: ...
 	    }
         }

The barrier() primitive is '__asm__ __volatile__("": : :"memory")',
which tells gcc that it cannot assume that x is unchanged.  If both
x and r1 were shared (though this does not make much sense to me for
this code fragment), one might introduce a local variable as
follows:

	{
	  int r2;

 	  r1 = x;
	  barrier();
 	  r2 = r1;
	  barrier();
 	  if (r1 < 2) {
 	      ...
 	      switch(r1) {
 	        case 0:
 	        case 1: ...
 	      }
           }
         }

> (Of course all of this also breaks trivially if the load of x is
> not atomic, as might be the case on some embedded hardware.  Thus
> even a volatile qualifier isn't guaranteed to work, but is probably
> sufficient in practice for Linux.)

Yep, Linux mandates that hardware atomically load and store properly
aligned ints and pointers.

> 4) I have very mixed feelings about load or store fences at this level.
> I think they do make sense for strict RCU and a few similar applications.
> I also think they tend to lead to subtlely buggy code, especially in the
> absence of dependence-based ordering for applications that involve
> handing off an object to another thread, which may then write it.
> I currently haven't seen enough evidence that adding this kind of
> ordering constraint (or fence) actually improves performance a lot
> on major platforms.  If it does, I could live with adding it to
> low-level atomics, though it would make the interface even wider.

OK...  What do you mean by "low-level atomics"?  We do need to allow
people to create new interfaces from low-level primitives, even if we
strongly discourage general use of those low-level primitives.  And
just for the record, I agree that use of explicit memory barriers should
be minimized -- this is in fact one reason why they are subsumed into
the RCU API -- most users of the RCU API need not concern themselves
with memory barriers.  But the people creating RCU implementations
-do- need to work with explicit memory barriers.  And yes, there is
more than just me doing this for the Linux kernel.  ;-)

> 5) (relatively minor) Several of us believe that X86 loads are
> not reordered, at least on modern hardware and write-back cacheable
> memory.  I do agree that the spec makes that less than 100% clear,
> something I hope will be fixed.  This assumption is implicit in some
> prior discussions.  (It is clearly documented in the X86 part of
> the Itanium documentation, but that isn't known to apply to all
> X86 processors.)

Getting a clear x86 memory ordering spec that Intel and AMD agree on
would be very good.  It would also be good if the system vendors who
produce large x86 systems agreed as well.  ;-)

							Thanx, Paul