volatile and barriers (especially on PPC)
Maged Michael
magedm at us.ibm.com
Fri Mar 11 20:59:03 GMT 2005
>> If we allow volatile to have ordering
>> semantics weaker than sequential consistency, we might as well go all
>> the way and separate volatility from memory ordering (which I prefer).
> That's the second time today that someone has said this to me,
> and both times, I have no idea what it means! :-)
If my guess is right, he was in my office when I got your message :-)
> Could you explain?
>
> (The only thing I think you could mean here is to propose that ordinary
> programmers writing ordinary things like double-check will need to take
> responsibility for ordering by manually correctly inserting barrier
> instructions? If so, you will need a very compelling defense for
> handing a new stack of razor blades to C++ programmers.)
Yes. That is what I meant. But I understand the motivation behind giving
volatile strong memory ordering semantics. My point was that if volatile
has acquire/release semantics but not sequential consistency, then the
ordinary programmer is not completely shielded any way and will have to
understand something about memory ordering in order to know when using
volatile is adequate and when it is not. I am concerned that in this case
the programmer might overestimate the safety provided by using volatile
and get it in trouble, while also paying performance cost for this
incomplete protection.
So I support either options (1) or (2):
(1) (a) Accesses to volatiles are completely ordered as in Java.
(b) There are standard libraries (e.g. atomic_ops) , i.e., stack of
razor blades :-) that allow the programmer to write code with less barrier
(2) (a) Volatility is separate from ordering.
(b) Define constructs that allow the programmer to specify ordering
accurately, again the stack of razor blades.
(3) Something in between.
However, there is a possibility that the future ordinary programmer will
have to be savvy enough to handle a stack of sharp razors any way. I am
not a compiler expert, so I don't know if given programs written using
(1)(a) or (3) how successful the compiler can be in generating code as
fast (or almost as fast) as code generated for programs written using
(1)(b) or (2). If the compiler is not successful, then there is the
possibility that the ordinary programmer might abandon volatile and use
only (1)(b) to seek performance. Also, I hope we don't overestimate the
ability of compilers by only taking into account the few examples like DCL
that are widely used today while ignoring emerging algorithms that may be
popular in the future, in particular non-blocking transactions and data
structures. We need to look at enough examples to reach an answer that
takes that into account. My guess (again I am not compilers expert) is
that there will be many unnecessary barriers left unoptimized in
implementations of many efficient non-blocking algorithms.
The following is from the java.util.concurrent ConcurrentLinkedQueue
class. I just numbered the memory accesses in the critical path. We need
to have the following ordering, assuming no speculative writes.
1 before 2
1 before 3
2 before 4
3 before 4
3 before 6
public E poll() {
for (;;) {
1 Node<E> h = head;
2 Node<E> t = tail;
3 Node<E> first = h.getNext();
4 if (h == head) {
if (h == t) {
if (first == null)
return null;
else
casTail(t, first);
5 } else if (casHead(h, first)) {
6 E item = first.getItem();
if (item != null) {
7 first.setItem(null);
return item;
}
// else skip over deleted item, continue loop,
}
}
}
}
If volatile doesn't have any memory ordering semantics and if we have
hypothetical memory barriers LoadLoad, LoadStore, StoreStore and
StoreLoad. We need LoadLoad between 1 and 2, and between 3 and 4.
That is 2 LoadLoad barriers.
If volatile has full barrier semantics, the compiler will have to assume
that the following ordering is required (in the critical path):
1 before 2 LoadLoad
2 before 3 LoadLoad
3 before 4 LoadLoad
4 before 5 LoadLoad
5 before 6 and 5 before 7 StoreStore/StoreLoad (I don't think
LoadLoad/LoadStore is sufficient)*
6 before returning LoadLoad/LoadStore
7 before returning StoreStore/StoreLoad
* CAS has both read and write semantics but I think that the compiler will
assume that the programmer intended for the write of the CAS to be
completed before 6 and 7 are initiated.
Also even if we consider a simple example as DCL, if the compiler doesn't
know from where the function containing DCL was called, wouldn't the
compiler have to add a LoadLoad/StoreLoad barrier before the first read in
DCL? So, I am not sure that compilers can handle even DCL in general, but
they can handle it well if it is inlined.
-Maged
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://shadbolt.decadentplace.org.uk/pipermail/cpp-threads/attachments/20050311/bbf12f9c/attachment.htm
More information about the cpp-threads
mailing list