[cpp-threads] RE: volatile, memory models, threads

Tue Feb 28 21:49:08 GMT 2006

"Boehm, Hans" <hans.boehm at hp.com> wrote:
>
> Agreed.  But at that point we have a lot of choices for the
> representation of atomic<int>.  I think that most of the time we don't
> need to take advantage of it.  But on something like PA-RISC that has
> severe alignment constraints for the only test-and-set-like instruction,
> some of these may be significantly bigger than the underlying type.

Yes, but ....  I am happy to agree that a single atomic integer should
be all that is REQUIRED, but what I am concerned about is that many
people seem to be assuming that an atomic integer (and perhaps a single
pointer) is all that is NEEDED.  This will (almost inevitably) lead to a
specification such that it is all that is IMPLEMENTABLE on many
architectures.

I don't think that such a decision should be made without thought as
to the consequences.

> Explain, please?  We are not currently proposing atomics as a
> replacement for locks.  They are intended to give you access to the
> primitives provided by most hardware.  Those tend to be confined to 1,
> 2, 4, 8, and occasionally 16 byte objects.  (In the case of some M68K
> processors, you can atomically update two locations at the same time.)

Is that a useful target?  Far too many languages of the past have
created their specifications from existing hardware limitations and
features, and virtually every one has regretted it a decade down the
line.

There are many important algorithms that are not practical to use if the
only atomic primitive is to update a single, scalar object.  Parallel,
unlocked insertion into a doubly linked list is one example of many.
I feel that a decent language specification (C++ in this case) should
not make difficulties for implementations wanting to do better.  But
that decision has consequences.

> Alpha has explicit memory fences.  You don't need locks.

LLC/STC loops are equivalent to locks, except that they are not
guaranteed to progress.  If any large Alpha SMP systems had been used
for high CPU-count, high-communication SMP programs, there would have
been major problems.

> IBM z-Series processors seem to guarantee ordering for naturally aligned
> 1, 2, 4, and 8 byte objects, except for a store followed by a load.  I
> suspect you do need a compare-and-swap to enforce store-load ordering.
> But I believe that can be to a stack location, and thus it doesn't
> affect layout.

Grrk.  It wasn't even remotely true on the System/370 series, and it
wasn't that simple on the System/390 one; my guess is that it still
isn't that simple.  The typical symptom of an assumption that is almost,
but not completely, true is that you get INCREDIBLY obscure failures
with very low probability.

> A lot of our current understanding of how various architectures work
> with respect to ordering and synchronization is encoded in two places:
> 
> Doug Lea's JSR 133 Cookbook
> (http://gee.cs.oswego.edu/dl/jmm/cookbook.html, relatively easy to
> understand, though with a Java bias.)
> 
> The implementation of the atomic_ops package.  (Harder to understand,
> but I have tried to insert suitable comments.  You can see the relevant
> section of the source tree at
> http://cvs.sourceforge.net/viewcvs.py/bdwgc/bdwgc/libatomic_ops-1.1/src/
> atomic_ops/sysdeps/, particularly in the "gcc" subdirectory.)

I will try to look at them :-(

> My current tentative proposal, which I urgently need to revise, includes
> two flavors of atomic<T>.  One provides a query mechanism to ask whether
> the hardware provides the appropriate primitives.  The other does not
> promise to be lock-free, and hence allows lock-based emulation.  See
> http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/atomics.h.txt

When I last looked, it didn't address my concerns.  We can discuss them
in Berlin, but they can be summarised as whether the standard should be
written to enable implementations to implement general atomic types in
software.

> I'm not sure how that affects us.  TLB handling code is inherently
> extremely platform dependent, and thus not something we can really
> address.  It is not my goal to let you write all lock-free code portably
> using this library, since I think that would be an unattainable goal.  I
> would like to be able to write user-level code portably while minimizing
> (though sometimes not eliminating) the resulting performance loss.  I
> expect that will also help in writing kernel code, but not for the
> really esoteric cases.  In particular, I'm not convinced we want to
> address uncachable memory and the like in the standard.

That's NOT my point!!!!!

What I am saying is that certain LANGUAGE features can assume or
require certain operating system implementations or configuration.
There are some catastrophic mistakes with most threading libraries in
the areas of scheduling, processor affinity, exception handling and even
I/O.  I can't tell you how much chaos the last caused with the early
parallel compilers, and how serious its constraints are on parallelism
even today.

I really DO mean that it is common for an unprivileged parallel
application to REQUIRE a system to be configured to meet its
assumptions.  Been there - done that :-(

My point is that the STANDARD should minimise its impact on such things,
because otherwise you WILL then get into a position wheresome poor sod
like me needs to reboot between running standard language X and standard
language Y, and the users won't even know what language their
application is using.

> I'm not sure what you mean by a global barrier.  You mean a global lock
> to protect access to the atomic data?

Yes.

>  As an implementation issue, I
> would advocate a global hash table of locks, indexed by the address of
> the atomic<> data.  I don't that for the library implementation, we are
> precluding adding the lock to each piece of data.  But I don't think
> that's a wise default implementation choice, since arrays containing a
> billion atomically updatable char's are not that unlikely.

Now consider thread A atomically updating bytes N:M and thread B doing
so to P:Q, where N:M and P:Q overlap.  Should that be forbidden?

> I think the far more serious problem with those is that if you use
> 
> struct { char a; char b; }
> 
> and protect a and b with different locks, things break.  Badly.  And
> this may happen for globals in addition to data members.  In my view,
> that's intolerable.  You can't write reliable multithreaded programs on
> such a machine.  Or at least you would have to be intimately familiar
> with the specific data layouts used on your machine, and thus it is
> hopeless to write portable code.

Well, yes, but there are a zillion forms of the generic problem.

> The same thing happens with OpenMP code operating on char arrays, I
> believe.  It's just broken.
> 
> Thus if you're selling such a machine, I think you have several options:
> 
> 1. Forget about multithreaded code.
> 2. Forget about multiprocessors, and hide the problem in the scheduler,
> by not allowing rescheduling at inopportune moments.

That doesn't work, I am afraid.  Random users deadlocking the system, if
you are lucky.

> 3. Make sizeof(char) =3D sizeof(word).
> 4. Provide a nonstandard programming model.
> 5. If you really need to export a convenient and standard multithreaded
> C++-based (or C, Java, C# -based) programming model, and can't afford to
> expand small scalars, redesign the hardware.  It can't be done on the
> existing hardware.  Sometimes you have to admit to your mistakes and fix
> them.

Dare I mention the Alpha in this context? :-)

> I would claim that:
> 
> a) These are basically the same choices you have now.  (You might now be
> able to claim that your C++/pthreads implementation is standard
> conforming, but that doesn't change the fact that code written for other
> platforms won't run correctly.)
> b) Even introducing Java-like volatile semantics doesn't break things
> any more than they already are broken.  The same alternatives apply.

No dissent there.  All I am arguing is that we need to agree on what
the aim of the atomic model should be before designing a specification.

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1 at cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679