volatile and barriers (especially on PPC)

Fri Mar 11 20:59:03 GMT 2005

>> If we allow volatile to have ordering 
>> semantics weaker than sequential consistency, we might as well go all 
>> the way and separate volatility from memory ordering (which I prefer).

> That's the second time today that someone has said this to me,
> and both times, I have no idea what it means! :-)

If my guess is right, he was in my office when I got your message :-)

> Could you explain?
>
> (The only thing I think you could mean here is to propose that ordinary
> programmers writing ordinary things like double-check will need to take
> responsibility for ordering by manually correctly inserting barrier
> instructions? If so, you will need a very compelling defense for
> handing a new stack of razor blades to C++ programmers.)

Yes. That is what I meant. But I understand the motivation behind giving 
volatile strong memory ordering semantics. My point was that if volatile 
has acquire/release semantics but not sequential consistency, then the 
ordinary programmer is not completely shielded any way and will have to 
understand something about memory ordering in order to know when using 
volatile is adequate and when it is not. I am concerned that in this case 
the programmer might overestimate the safety provided by using volatile 
and get it in trouble, while also paying performance cost for this 
incomplete protection.

So I support either options (1) or (2):

(1) (a) Accesses to volatiles are completely ordered as in Java.
    (b) There are standard libraries (e.g. atomic_ops) , i.e., stack of 
razor blades :-) that allow the programmer to write code with less barrier

(2) (a) Volatility is separate from ordering.
    (b) Define constructs that allow the programmer to specify ordering 
accurately, again the stack of razor blades.

(3) Something in between.

However, there is a possibility that the future ordinary programmer will 
have to be savvy enough to handle a stack of sharp razors any way. I am 
not a compiler expert, so I don't know if given programs written using 
(1)(a) or (3) how successful the compiler can be in generating code as 
fast (or almost as fast) as code generated for programs written using 
(1)(b) or (2).  If the compiler is not successful, then there is the 
possibility that the ordinary programmer might abandon volatile and use 
only (1)(b) to seek performance.  Also, I hope we don't overestimate the 
ability of compilers by only taking into account the few examples like DCL 
that are widely used today while ignoring emerging algorithms that may be 
popular in the future, in particular non-blocking transactions and data 
structures. We need to look at enough examples to reach an answer that 
takes that into account. My guess (again I am not compilers expert) is 
that there will be many unnecessary barriers left unoptimized in 
implementations of many efficient non-blocking algorithms.

The following is from the java.util.concurrent ConcurrentLinkedQueue 
class. I just numbered the memory accesses in the critical path. We need 
to have the following ordering, assuming no speculative writes.

1 before 2
1 before 3
2 before 4
3 before 4
3 before 6

    public E poll() {
        for (;;) {
 1           Node<E> h = head;
 2           Node<E> t = tail;
 3           Node<E> first = h.getNext();
 4           if (h == head) {
                if (h == t) {
                    if (first == null)
                        return null;
                    else
                        casTail(t, first);
 5               } else if (casHead(h, first)) {
 6                   E item = first.getItem();
                    if (item != null) {
 7                       first.setItem(null);
                        return item;
                    }
                    // else skip over deleted item, continue loop,
                }
            }
        }
    }

If volatile doesn't have any memory ordering semantics and if we have 
hypothetical memory barriers LoadLoad, LoadStore, StoreStore and 
StoreLoad. We need LoadLoad between 1 and 2, and between 3 and 4. 
That is 2 LoadLoad barriers.

If volatile has full barrier semantics, the compiler will have to assume 
that the following ordering is required (in the critical path):

1 before 2                              LoadLoad
2 before 3                              LoadLoad
3 before 4                              LoadLoad
4 before 5                              LoadLoad
5 before 6 and 5 before 7       StoreStore/StoreLoad    (I don't think 
LoadLoad/LoadStore is sufficient)*
6 before returning              LoadLoad/LoadStore
7 before returning              StoreStore/StoreLoad

* CAS has both read and write semantics but I think that the compiler will 
assume that the programmer intended for the write of the CAS to be 
completed before 6 and 7 are initiated.

Also even if we consider a simple example as DCL, if the compiler doesn't 
know from where the function containing DCL was called, wouldn't the 
compiler have to add a LoadLoad/StoreLoad barrier before the first read in 
DCL? So, I am not sure that compilers can handle even DCL in general, but 
they can handle it well if it is inlined.

-Maged
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://shadbolt.decadentplace.org.uk/pipermail/cpp-threads/attachments/20050311/bbf12f9c/attachment.htm