it's a rough rough rough draft

Fri Sep 10 06:18:45 BST 2004

On Thu, 9 Sep 2004, Andrei Alexandrescu wrote:

> > We can't possibly turn this into a threads introduction.
> > Nor do I think we should try.  We should reference one.
> > I think Andrew Birrell's "An Introduction to Programming
> > with C# Threads"
> > (http://research.microsoft.com/~birrell/papers/ThreadsCSharp.pdf)
> > is probably fine, and has a good intro,
> > though some of the finer details don't carry
> > over.  Anyone know of a better one?
>
> I think referring people to Butenhof's book on pthreads is the best thing to
> do.
Why don't we do both?  I think an on-line reference is good as well.
>
> The point is not to include a tutorial on threads, but rather a tutorial on
> memory visibility. As I told Doug, most people on the committee do
> understand lock-driven programming where you acquire a lock and then see
> everything everybody ever did and can do anything you want. Then they try to
> integrate an understanding of memory model within that simple framework, and
> they hit a brickwall.
>
> I've had a very very relevant exchange with a gentleman on clc++m that has
> this exact pattern: "I'm not an expert on threads, but I would like to
> understand these things you are referring to". Me: "Here's some info, and
> here are some references." Him (furious): "I don't have the time and the
> inclination to read the references, but what you say makes no sense. You are
> snooty. I claim to have a pretty good understanding of threads, and what you
> say makes absolutely no sense in that framework, so my conclusion is that
> you are enjoy acting precious by using obscure language and not discussing
> your views." (Of course, not in these exact words, but that's the impression
> I've gotten.)
>
> We should prevent such a reaction.
I agree.  But it will be hard to add a lot of motivational stuff between
now and tomorrow.
>
> > Section 4:
> >
> > The core of the new Java memory model is to classify synchronization
> > operations as "acquire" (e.g. lock acquisition) or "release" (e.g. lock
> > release) operations on some synchronization object.  A "release"
> > operation has the effect (among others) of ensuring that all memory
> > operations performed by a thread before the "release" operation
> > become visible to another thread after it performs a corresponding
> > "acquire" operation.  Thus, for example a thread that acquires a lock
> > is guaranteed to see all updates that occurred while another thread
> > held the same lock.
>
> Java needed to do that because the synchronization object were implicit
> (though I know Java 1.5 has explicit ones as well), but for C++ I think
> synchronization objects should be separate from the memory model. In wake of
> Doug's idea that the library and memory model proposals could be separate, I
> think it's easier to go a different route: We define terms such as memory
> barriers, weak read, strong read, weak write, and strong write, and later
> on, when we define synchronization objects and whatnot, we mention what they
> do in terms of the notions we defined in the memory model.
I think that something like release and acquire operations need to be part of
the memory model.  A "strong read", if I understand what you mean, behaves
very analogously pthread_mutex_lock, which is similar to EnterCriticalSection,
thread join, etc with respect to visibility rules.  We want to capture that
commonality.  We can call it something else, but I think the acquire-release
terminology is fairly well established.

It is less clear whether acquire and release operations need to match
to ensure visibility, as in Java, or whether any acquire ensure visibility
to any subsequent release, as for the Itanium hardware model for
example.

We may also want to consider operations that have release and acquire
semantics only with respect to loads or stores, not both.

My impression is also that few people really understand what memory
barriers are.  Explaining them as a nop with both acquire and release
semantics is probably as good as anything, though that only works if
we don't have a notion of matching acquires and releases.
>
> > Section 5:
> >
> > I would make this much more tentative, e.g. start with:
> >
> > We are planning to propose a set of atomic operations on shared memory
> > locations to support lock-free programming.  For each such operation
> > we will need to specify not only its function, but also the ordering
> > constraints it imposes, e.g. whether it behaves as an "acquire" or
> > "release" operation, or neither or both.  For many operations
> > multiple variants make sense.  We are currently undecided as
> > to both the syntax of these constraints, and how many variants there
> > should be.  Thus we omit them for the rest of this presentation.
>
> Sounds good, though people might confuse lock-free programming with the
> whiz-bang lock-free programming that has become popular nowadays. But said
> primitives can be used for much more conservative purposes, such as
> manipulating the reference count of a smart pointer.
We should clearly add examples.  But not by tomorrow.
>
> > I think cas should be one of the atomic_ primitives.
>
> Ok. By the way, I have two questions I'd like to ask:
>
> 1. Is there any machine of any importance to anyone that has threads but no
> CAS (or some way of implementing it efficiently)?
Yes. PA-RISC has no such instruction.  It does have fast kernel calls, I
think, so you could implement it with a kernel lock, sort of.  Pentium 4s
do have the instruction, but in the uncontended case, I wouldn't
be surprised if the emulation on PA-RISC could be faster.  It takes
on the order of 200 cycles.

Architectures differ widely as to what ordering implications  CAS has,
i.e. whether it includes a memory barrier, and what kind.  On Itanium,
you get a choice of "acquire" or "release".  But I think you want at
least all four possible combinations of those.
>
> 2. Couldn't the compiler use CAS to resolve the word tearing problem? I
> might be wrong here, but to me it seems like any word tearing problem can be
> solved if the compiler generates appropriate looped CAS to generate a proper
> read-modify-write sequence. Now things might be less efficient, but I don't
> know by how much.
Yes.  It would roughly make your Pentium 4 run roughly like a 486.

Hans