[cpp-threads] RE: volatile, memory models, threads

Tue Feb 28 19:42:45 GMT 2006

> From:  Nick Maclaren
> [Hans:]
> > Re: alignment and casting
> >
> > This clearly needs to be though through carefully.  I think 
> that with 
> > the atomic<int>-style library solution, we don't have a problem.
> 
> Yes, we do, though it can be hidden.  My point is that it 
> isn't solved AUTOMATICALLY - wrapping actions in functions is 
> NOT the same as restricting the allocation.
Agreed.  But at that point we have a lot of choices for the
representation of atomic<int>.  I think that most of the time we don't
need to take advantage of it.  But on something like PA-RISC that has
severe alignment constraints for the only test-and-set-like instruction,
some of these may be significantly bigger than the underlying type.
> 
> > At the other extreme of redefining volatile, I'm not sure 
> how big the 
> > problem is either.  All standard modern architectures seem to 
> > sufficiently align small scalars that can be atomically loaded or 
> > stored.  Usually not doing so involves as substantial 
> performance hit.
> 
> Provided that you are prepared to restrict yourself to 1, 2, 
> 4 and 8 byte objects, aligned as for their size, most current 
> ones will allow you to do this.  Of course, that doesn't 
> allow you to implement many important graph theoretic 
> algorithms, but ....
Explain, please?  We are not currently proposing atomics as a
replacement for locks.  They are intended to give you access to the
primitives provided by most hardware.  Those tend to be confined to 1,
2, 4, 8, and occasionally 16 byte objects.  (In the case of some M68K
processors, you can atomically update two locations at the same time.)
> 
> However, see below.
> 
> > I know of no architecture that requires a lock to be 
> associated with 
> > an atomically accessed location just to get memory ordering.  ...
> 
> I do.  Lots.  Alpha.  System/370.  Many others I forget.
Alpha has explicit memory fences.  You don't need locks.

IBM z-Series processors seem to guarantee ordering for naturally aligned
1, 2, 4, and 8 byte objects, except for a store followed by a load.  I
suspect you do need a compare-and-swap to enforce store-load ordering.
But I believe that can be to a stack location, and thus it doesn't
affect layout.

A lot of our current understanding of how various architectures work
with respect to ordering and synchronization is encoded in two places:

Doug Lea's JSR 133 Cookbook
(http://gee.cs.oswego.edu/dl/jmm/cookbook.html, relatively easy to
understand, though with a Java bias.)

The implementation of the atomic_ops package.  (Harder to understand,
but I have tried to insert suitable comments.  You can see the relevant
section of the source tree at
http://cvs.sourceforge.net/viewcvs.py/bdwgc/bdwgc/libatomic_ops-1.1/src/
atomic_ops/sysdeps/, particularly in the "gcc" subdirectory.)

> You 
> may (with some reason) claim that they are all obsolete, but 
> the arguments against that are that there are good reasons to 
> believe that applies ONLY to the CURRENT generation of 
> general-purpose SMP systems.
I think that we have to accommodate the IBM z Series and I would
certainly like to accommodate recent Alpha processors.  (HPs web page
states that Alphas will stop shipping near the end of 2006, about 3
years before we expect this standard to take effect.)

I don't believe we can effectively accommodate very early Alphas without
atomic byte stores.  But I don't believe any of those will have been
shipped within 10 years of the adoption of the standard.  I'm not
familiar with similar issues in IBMs 370-390-z-series.  If there are, I
would certainly advocate ignoring pre-2000 hardware there as well.  (The
situation with Java JSR133 and initialization safety on Alpha is
inherently different, I think.  We don't gyarantee initialization
safety.)

> Here are a few of the issues:
> 
>     1) That applies only if the only operations allowed on 
> atomic objects are built into the hardware as basic 
> operations (and note that many architectures have a 
> distinction between basic and compound hardware operations).  
> So that means that the library can specify ONLY load and 
> store (no increment), without implying that actions on atomic 
> data may need a global barrier.
My current tentative proposal, which I urgently need to revise, includes
two flavors of atomic<T>.  One provides a query mechanism to ask whether
the hardware provides the appropriate primitives.  The other does not
promise to be lock-free, and hence allows lock-based emulation.  See
http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/atomics.h.txt

Both flavors require the programmer to explicitly specify ordering
constraints with operations.  Memory fences should be inserted only when
required.
> 
>     2) It isn't just a hardware issue, but constrains how TLB 
> and even machine-check interrupts are handled.  There are 
> system designs where TLBs have the same multiple-reader or 
> single-writer semantics as cache lines (they ARE a form of 
> cache, after all).  It is then ESSENTIAL for the CPU either 
> to retain exclusive control over the cache line while 
> changing the TLB state or to retry the operation in its 
> entirety for something like an increment.
I'm not sure how that affects us.  TLB handling code is inherently
extremely platform dependent, and thus not something we can really
address.  It is not my goal to let you write all lock-free code portably
using this library, since I think that would be an unattainable goal.  I
would like to be able to write user-level code portably while minimizing
(though sometimes not eliminating) the resulting performance loss.  I
expect that will also help in writing kernel code, but not for the
really esoteric cases.  In particular, I'm not convinced we want to
address uncachable memory and the like in the standard.
> 
>     3) Permitting that assumption into the standard prevents 
> an implementation from using locks to extend atomic objects 
> from beyond what the hardware supports (e.g. to allow pairs 
> of pointers, or a pointer+count).  Except, of course, by 
> using a global barrier.
I'm not sure what you mean by a global barrier.  You mean a global lock
to protect access to the atomic data?  As an implementation issue, I
would advocate a global hash table of locks, indexed by the address of
the atomic<> data.  I don't that for the library implementation, we are
precluding adding the lock to each piece of data.  But I don't think
that's a wise default implementation choice, since arrays containing a
billion atomically updatable char's are not that unlikely.

> 
>     4) I have heard that quite a few of the DSP architectures 
> are word-based, and so they would have the Alpha problem of 
> actions on a sub-word not be atomic.  Yes, Alpha relaxed 
> that, but why should DSPs jump through hoops to provide an 
> expensive feature nobody wants?
I think the far more serious problem with those is that if you use

struct { char a; char b; }

and protect a and b with different locks, things break.  Badly.  And
this may happen for globals in addition to data members.  In my view,
that's intolerable.  You can't write reliable multithreaded programs on
such a machine.  Or at least you would have to be intimately familiar
with the specific data layouts used on your machine, and thus it is
hopeless tow rite portable code.

The same thing happens with OpenMP code operating on char arrays, I
believe.  It's just broken.

Thus if you're selling such a machine, I think you have several options:

1. Forget about multithreaded code.
2. Forget about multiprocessors, and hide the problem in the scheduler,
by not allowing rescheduling at inopportune moments.
3. Make sizeof(char) = sizeof(word).
4. Provide a nonstandard programming model.
5. If you really need to export a convenient and standard multithreaded
C++-based (or C, Java, C# -based) programming model, and can't afford to
expand small scalars, redesign the hardware.  It can't be done on the
existing hardware.  Sometimes you have to admit to your mistakes and fix
them.

I would claim that:

a) These are basically the same choices you have now.  (You might now be
able to claim that your C++/pthreads implementation is standard
conforming, but that doesn't change the fact that code written for other
platforms won't run correctly.)
b) Even introducing Java-like volatile semantics doesn't break things
any more than they already are broken.  The same alternatives apply.

Hans