volatile and barriers (especially on PPC)

Fri Mar 11 18:49:50 GMT 2005

Here's my perception of the problem, which I think differs a bit:

Consider the assignment v = 17, where v is volatile.  Assume we want
to minimize the cost for readers, as I (and Doug) think we do in most
settings.  The alternative implementations are:

a) Sequential consistency + release for volatiles, as in Java.
General implementation in terms of Doug's primitives:

*Store barrier
st v=17
Store* barrier

b) Release only for volatiles, as in Itanium C/C++ ABI:

*Store barrier
st v=17

On the reader side, I think there is no difference.

My intuition is that the second barrier in (a) can usually not be
optimized away.  This will most commonly occur for DCL-like idioms.
In C++ with file-at-a-time compilation, I think you generally won't
have enough context to see things like consecutive volatile stores.
Java-style compilation probably does a bit better here.
For more sophisticated lock-free algorithms, things might be better.
And perhaps those are dynamically more important.

In general, the first barrier in (a) is much cheaper than the second,
e.g. 0 cycles vs. 120 on a P4, and similarly for SPARC TSO,
with more like a factor of 2 difference on PPC and Itanium.  (Best
case on Itanium seems to be around 5 and 10 cycles, respectively.)

Thus I conclude that volatile writes are MUCH cheaper if we don't have
sequential consistency for volatiles.  Volatile reads are fine either
way.
And volatile reads probably dominate.  Certainly for DCL.

I think there are important classroom examples for which the difference
matters.  But my recollection of the JSR133 discussion is that we came
up with essentially no real examples.

Thus on a purely practical ("it's fast and mostly works") basis,
I think we're much better off with the weaker semantics.

Nonetheless, I think we did the right thing for Java in going with  the
stronger semantics, since the weaker one seemed very unintuitive, and
difficult to explain.  But I'm not sure whether that argument applies
here,
for two reasons:

1) C++ has traditionally accepted a lot more complexity in order to
get performance.

2) I'm not yet sure whether this is as hard to specify if we leave
the semantics of races undefined, and can concentrate on simply
defining when there is a race.  This is worth thinking about.

Hans

> -----Original Message-----
> From: Doug Lea [mailto:dl at cs.oswego.edu] 
> Sent: Friday, March 11, 2005 10:00 AM
> To: Maged Michael
> Cc: Doug Lea; Andrei Alexandrescu; asharji at plg.uwaterloo.ca; 
> Ben Hutchings; clark.nelson at intel.com; Boehm, Hans; 
> jimmaureenrogers at att.net; Kevlin Henney; 
> pabuhr at plg.uwaterloo.ca; Bill Pugh; Richard Bilson
> Subject: Re: volatile and barriers (especially on PPC)
> 
> 
> 
> 
> > So, considering PPC, it would be preferable not to pay the price of
> > unnecessary StoreLoad barriers (which may be double the cost of
> > release on Power5) that are difficult for the compiler to remove
> > safely.
> 
> I think the important questions here are:
> 
> 1. If volatile had strong semantics, how often would they
>     be stronger than actually required in an application?
> 
> 2. Of those in (1) how many can be weakened using known optimization
>     techniques?
> 
> 3. Of those remaining from (2), how many are used
>     in constructions where a factor of two in barrier cost
>     makes a measurable difference in program performance? (Weight
>     this by the fact that on some platforms, all barriers cost
>     about the same so there is no saving.)
> 
> 4. Of those remaining from (3) how many are found in constructions
>     that are likely to ever be encountered by non-experts? (Experts
>     could instead use the atomics library to micro-optimize.)
> 
> The results of these questions lead to a judgement call about 
> which way makes the most sense to standardize upon. You can 
> tell what my guesses to the answers to these questions are. 
> But they are just guesses.
> 
> 
> > cycles. So far I haven't got my hands on a Power5, but I have been 
> > told that on Power5 lwsync takes about 30 cycles and sync 
> takes about 
> > 60 cycles. However, these numbers may be too low because 
> they were run 
> > on dual processor only and may be expected to be higher on a larger 
> > machine.
> > 
> 
> Digression: Happily we seem to nearing the end of the era 
> when the chip designers thought they could get away with 
> 100-200 cycle barriers and atomics.(p4/xeon, and, I hear, 
> Intel EM64T are the worst.) Opterons, sparc-niagara, 
> itanium-2 and power5 are all pretty good, ranging around 20-60 cycles.
> 
> -Doug
> 
>