volatile and barriers (especially on PPC)

Fri Mar 11 16:43:01 GMT 2005

Doug, I agree with your analysis below regarding release vs StoreLoad.
The short answer regarding PPC is that release is cheaper than full 
barrier that include StoreLoad, and that difference may be significant.

I should note that it is impossible to give an exact cost of barrier 
instructions (at least on PPC) as it depends on the machine size, and the 
instructions around the barrier instructions. So the following are rough 
estimates. On experiments (using a shared counter) on an 8-way Power4, 
lwsync (LoadLoad/LoadStore/StoreStore) takes about 120 cycles, while sync 
(full barrier including StoreLoad) takes about 150 cycles.
So far I haven't got my hands on a Power5, but I have been told that on 
Power5 lwsync takes about 30 cycles and sync takes about 60 cycles. 
However, these numbers may be too low because they were run on dual 
processor only and may be expected to be higher on a larger machine.

So, considering PPC, it would be preferable not to pay the price of 
unnecessary StoreLoad barriers (which may be double the cost of release on 
Power5) that are difficult for the compiler to remove safely.

However, wouldn't offering anything less that full ordering around 
volatiles defeat the purpose of attaching memory ordering semantics to 
volatile, which is to unburden the programmer (if s/he so desires) from 
thinking about memory ordering. If we allow volatile to have ordering 
semantics weaker than sequential consistency, we might as well go all the 
way and separate volatility from memory ordering (which I prefer).

-Maged

Doug Lea <dl at cs.oswego.edu> 
03/11/2005 07:11 AM

To
Maged Michael/Watson/IBM at IBMUS
cc
Bill Pugh <pugh at cs.umd.edu>, "Boehm, Hans" <hans.boehm at hp.com>, Ben 
Hutchings <ben at decadentplace.org.uk>, Kevlin Henney 
<kevlin at curbralan.com>, Richard Bilson <rcbilson at uwaterloo.ca>, 
clark.nelson at intel.com, Andrei Alexandrescu <andrei at metalanguage.com>, 
jimmaureenrogers at att.net, Doug Lea <dl at cs.oswego.edu>, 
asharji at plg.uwaterloo.ca, pabuhr at plg.uwaterloo.ca
Subject
volatile and barriers (especially on PPC)

I wrote...

> 
> I'm not so sure we want to go into these details, although my mention
> of Dekker's algorithm sorta implies we should. Here's something
> that we could somehow adapt if necessary. (This is mostly from
> memory of JMM discussions so could be wrong, although I did recheck
> IA64 ref manual vol 2, page 387+  --
> http://developer.intel.com/design/itanium/manuals/iiasdmanual.htm)
> 
> 
> The main question here is whether a write to a volatile must
> always entail a full StoreLoad (as in Java)(*), or whether it could be
> done with what on IA64 is a "Release" (st.rel). This shows up in
> Dekker's algorithm, A Release is only good with respect to an "Acquire"
> (ld.acq) on the SAME variable (modulo quirks).
> But Dekker's algorithm includes code of the form:
> 
> Thread 1:  write A;  read  B
> Thread 2:  write B;  read  A
> 
> So, if we allowed weaker version, programmers need to somehow know that
> they need to resort to the atomics library to implement this, and know
> to manually use a heavier barrier.
> 
> The choice is harder than it looks ...
>    1. It only impacts platforms for which Release is cheaper
>       than StoreLoad.
>    2. The majority of code using volatiles would work fine
>       with Release. In particular, nearly all uses of double-check.
>    3. Many programmers relying on read-after-write guarantees
>       WILL know enough to use atomics library.
>    4. The analysis needed to weaken StoreLoad to Release in those
>       cases where it would be OK is tricky,
>       requiring good alias analysis among other things, so is not
>       something you'd like to effectively mandate.
>       (Aside: On the other hand, detecting ONLY double-check
>       would probably get 90% of the potential speedup.)
>    5. IA64 has comparatively fast StoreLoad (mf) (compared to
>       p4/Xeon and EM64T anyway), so not doing this optimization
>       is not a huge loss.
>    6. Performance impact on PPC remains unknown to me, since I still
>       don't know the optimal forms of things like double-check that
>       apply to the various versions of of PPCs.
> 
> For Java, we chose to keep the usage rules as simple
> as we could, so used strong version.
> 
> (*) Actually, this all assumes that you choose to place the
> StoreLoad barriers after writes rather than before reads. This
> is almost always the best way, but there are a couple of cases
> where doing the opposite could win.
> 

Maged: Can you do me/us a favor and explain how
these issues impact various versions of PowerPC? One way
to do it would be to show the best code sequences for some common
constructions. And whether they could be generated automatically
by compilers assuming various semantics for volatiles, also
assuming existence of the simple optimizations I
described for Java and/or others along those lines.

Thanks!

-Doug

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://shadbolt.decadentplace.org.uk/pipermail/cpp-threads/attachments/20050311/2a5e7f42/attachment.html