[cpp-threads] RE: volatile, memory models, threads

Tue Feb 28 22:50:18 GMT 2006

> 
> I don't think that such a decision should be made without 
> thought as to the consequences.
> 
> > Explain, please?  We are not currently proposing atomics as a 
> > replacement for locks.  They are intended to give you access to the 
> > primitives provided by most hardware.  Those tend to be 
> confined to 1, 
> > 2, 4, 8, and occasionally 16 byte objects.  (In the case of 
> some M68K 
> > processors, you can atomically update two locations at the 
> same time.)
> 
> Is that a useful target?  Far too many languages of the past 
> have created their specifications from existing hardware 
> limitations and features, and virtually every one has 
> regretted it a decade down the line.
> 
> There are many important algorithms that are not practical to 
> use if the only atomic primitive is to update a single, 
> scalar object.  Parallel, unlocked insertion into a doubly 
> linked list is one example of many.
> I feel that a decent language specification (C++ in this 
> case) should not make difficulties for implementations 
> wanting to do better.  But that decision has consequences.
I don't think we know how to do better.  We could try to provide general
support for software transactions.  But I think that's an open research
problem.  And the performance you could get for a general implementation
on existing hardware probably would not make this a popular alternative.
And there's probably nothing beyond ugly syntax to prevent it from being
implemented in a higher layer.

In my mind all we can hope to do is provide access to low-level hardware
primitives, and lay the groundwork for higher level libraries that
provide standard implementations of lock-free data structures.

I personally think there are serious limits as to how many low level
differences you can hide in the low level interface.  Even if you
consider a linked stack, perhaps the simplest possible data structure,
you are driven to many different lock-free (or nearly lock-free)
implementation choices depending on the underlying hardware.  For
example, Itanium's cmp8xchg16 primitive leads to a different algorithm
from the Intel's EM64T cmpxchg16, which leads to different choices from
the original AMD64 instruction set which had neither.  It seems clear
that feature tests unfortunately have to be part of the story at some
level.
> 
> > Alpha has explicit memory fences.  You don't need locks.
> 
> LLC/STC loops are equivalent to locks, except that they are 
> not guaranteed to progress.  If any large Alpha SMP systems 
> had been used for high CPU-count, high-communication SMP 
> programs, there would have been major problems.
On Alpha, those don't ensure ordering, which I think was the issue here.
But Alpha has always had an MB (memory barrier) instruction to do so.
Lock implementations need to use both.  To simply ensure memory
ordering, MB suffices.
> 
> ...
> What I am saying is that certain LANGUAGE features can assume 
> or require certain operating system implementations or configuration.
> There are some catastrophic mistakes with most threading 
> libraries in the areas of scheduling, processor affinity, 
> exception handling and even I/O.  I can't tell you how much 
> chaos the last caused with the early parallel compilers, and 
> how serious its constraints are on parallelism even today.
> 
> I really DO mean that it is common for an unprivileged 
> parallel application to REQUIRE a system to be configured to 
> meet its assumptions.  Been there - done that :-(
I can certainly believe that, but I also think this is unacceptable in
the future, especially if we are aiming for mass market acceptance of
multicore processors.
> 
> My point is that the STANDARD should minimize its impact on 
> such things, because otherwise you WILL then get into a 
> position wheresome poor sod like me needs to reboot between 
> running standard language X and standard language Y, and the 
> users won't even know what language their application is using.
I agree that we cannot require processors to run in strange modes, or
the like.  But I don't see where we're doing that.
> 
> > I'm not sure what you mean by a global barrier.  You mean a global 
> > lock to protect access to the atomic data?
> 
> Yes.
> 
> >  As an implementation issue, I
> > would advocate a global hash table of locks, indexed by the 
> address of 
> > the atomic<> data.  I don't that for the library implementation, we 
> > are precluding adding the lock to each piece of data.  But I don't 
> > think that's a wise default implementation choice, since arrays 
> > containing a billion atomically updatable char's are not 
> that unlikely.
> 
> Now consider thread A atomically updating bytes N:M and 
> thread B doing so to P:Q, where N:M and P:Q overlap.  Should 
> that be forbidden?
To first order approximation, yes, unless N==P and M==Q.  There is an
issue with "double wide" compare-and-swap operations.
> > ...
> > Thus if you're selling such a [word addressed] machine, I think you
have 
> several options:
> > 
> > 1. Forget about multithreaded code.
> > 2. Forget about multiprocessors, and hide the problem in the 
> > scheduler, by not allowing rescheduling at inopportune moments.
> 
> That doesn't work, I am afraid.  Random users deadlocking the 
> system, if you are lucky.
I oversimplified a bit.  You use restartable critical sections, i.e. you
let the user process specify that if you are preempted at certain
locations, you should be restarted at a slightly different instruction.
> 
> > 3. Make sizeof(char) =3D sizeof(word).
> > 4. Provide a nonstandard programming model.
> > 5. If you really need to export a convenient and standard 
> > multithreaded
> > C++-based (or C, Java, C# -based) programming model, and 
> can't afford 
> > C++to
> > expand small scalars, redesign the hardware.  It can't be 
> done on the 
> > existing hardware.  Sometimes you have to admit to your 
> mistakes and 
> > fix them.
> 
> Dare I mention the Alpha in this context? :-)
Sure.  But Alpha fixed this close to ten years ago, I believe, at least
at the architectural level.  I'm only arguing that it's about time for
other architectures to do the same, if they both share the problem and
care about multithreading.

> ...
> All I am arguing is that we need to agree 
> on what the aim of the atomic model should be before 
> designing a specification.
Agreed.  But to be honest, I'm not sure what we could do that goes
significantly beyond the current proposal in scope.  My intuition is
that anything much more ambitious would lead to unacceptable performance
in some simple cases.  I'd be happy to be proved wrong.

Lawrence Crowl did have a syntactically more ambitious proposal at the
last C++ meeting, which allowed more arbitrary user-defined atomic
operations with a syntactic extension, but with a restriction that
effectively required that it be implementable with compare-and-swap.
But in the end, that seemed to specify than just exposing the
compare-and-swap.  And the committee wasn't enthusiastic about syntax
extensions.

Hans