[cpp-threads] RE: volatile, memory models, threads

Tue Feb 28 09:57:48 GMT 2006

I have removed the CCs, because they are accumulating.  I don't like
double copies, anyway.  If someone isn't on the list, please pass this
on (if relevant).

> Re: alignment and casting
>
> This clearly needs to be though through carefully.  I think that with
> the atomic<int>-style library solution, we don't have a problem.

Yes, we do, though it can be hidden.  My point is that it isn't solved
AUTOMATICALLY - wrapping actions in functions is NOT the same as
restricting the allocation.

> At the other extreme of redefining volatile, I'm not sure how big the
> problem is either.  All standard modern architectures seem to
> sufficiently align small scalars that can be atomically loaded or
> stored.  Usually not doing so involves as substantial performance hit.

Provided that you are prepared to restrict yourself to 1, 2, 4 and 8
byte objects, aligned as for their size, most current ones will allow
you to do this.  Of course, that doesn't allow you to implement many
important graph theoretic algorithms, but ....

However, see below.

> I know of no architecture that requires a lock to be associated with an
> atomically accessed location just to get memory ordering.  ...

I do.  Lots.  Alpha.  System/370.  Many others I forget.  You may
(with some reason) claim that they are all obsolete, but the arguments
against that are that there are good reasons to believe that applies
ONLY to the CURRENT generation of general-purpose SMP systems.
Here are a few of the issues:

    1) That applies only if the only operations allowed on atomic
objects are built into the hardware as basic operations (and note
that many architectures have a distinction between basic and compound
hardware operations).  So that means that the library can specify
ONLY load and store (no increment), without implying that actions
on atomic data may need a global barrier.

    2) It isn't just a hardware issue, but constrains how TLB and
even machine-check interrupts are handled.  There are system designs
where TLBs have the same multiple-reader or single-writer semantics
as cache lines (they ARE a form of cache, after all).  It is then
ESSENTIAL for the CPU either to retain exclusive control over the
cache line while changing the TLB state or to retry the operation
in its entirety for something like an increment.

    3) Permitting that assumption into the standard prevents an
implementation from using locks to extend atomic objects from
beyond what the hardware supports (e.g. to allow pairs of pointers,
or a pointer+count).  Except, of course, by using a global barrier.

    4) I have heard that quite a few of the DSP architectures are
word-based, and so they would have the Alpha problem of actions on
a sub-word not be atomic.  Yes, Alpha relaxed that, but why should
DSPs jump through hoops to provide an expensive feature nobody
wants?

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1 at cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679