[cpp-threads] Brief example ITANIUM Implementation for C/C++ MemoryModel

Fri Jan 2 16:12:48 GMT 2009

On Tue, Dec 30, 2008 at 11:40 PM, Hans Boehm <Hans.Boehm at hp.com> wrote:
>
> On Sat, 27 Dec 2008, Alexander Terekhov wrote:
>
>> On Fri, Dec 26, 2008 at 8:40 AM, Hans Boehm <Hans.Boehm at hp.com> wrote:
>>>
>>> On Wed, 24 Dec 2008, Peter Dimov wrote:
>>>
>>>>> Load Seq_Cst:  mf,ld.acq
>>>>
>>>> I think that the intent of the seq_cst spec was to allow a single ld.acq
>>>> here (and a simple MOV on x86).
>>>>
>>>
>>> Yes.  For both Itanium and X86.  Sorry.  I overlooked that.
>>
>> I disagree. ld.acq is load acquire, not seq_cst. Load seq_cst ought to
>> impose an extra leading store-load barrier (same as with trailing
>> store load barrier for seq_cst stores vs. st.rel). In the case of
>> adjacent seq_cst operations, redundant store-load fencing can be
>> optimized out by the compiler/implementation. Think of mixing seq_cst
>> operations with relaxed and/or acquire/release ones.
>>
> The seq_cst operations must effectively ensure that data-race-free
> programs using only seq_cst operations behave sequentially consistently.
> In a data-race-free program, you cannot tell if an ordinary load followed
> by a seq_cst load are reordered.  Neither can you tell if a seq_cst atomic
> store followed by an ordinary store are reordered.
>
> Thus the only reason to add fences before a seq_cst load or after a
> seq_cst store is to prevent reordering of a seq_cst store followed by a
> seq_cst load.  For that, it suffices to do one or the other; you don't
> need both.

That would be the case if C/C++ would offer only seq_cst atomic
operations without any relaxed ones (I mean release/acquire/relaxed).
But that is not the case under the current draft. So my interpretation
is based on simple reasoning: seq_cst means fully-fenced (with
redundant fencing being eligible for removal by optimizers).

> It's usually cheaper to add the fence only for stores, though
> PowerPC has other constraints.  If you see two consecutive seq_cst stores,
> you only need the fence following the second.

Umm. Under

http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2008.12.16a.html

seq_cst store is

"Store Seq Cst hwsync; st"

and for two consecutive seq_cst stores this results in

hwsync; st, hwsync; st

without any fence following the second store. Instead

http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2008.12.16a.html

prescribes leading hwsync for loads:

"Load Seq Cst hwsync; ld; cmp; bc; isync "

which is quite similar with respect to store-load fencing to Itanium's

"mf,ld.acq"

Under my interpretation, "Store Seq Cst hwsync; st" is not strong
enough and should rather be

"Store Seq Cst lwsync, st, hwsync"

which is quite similar with respect to store-load fencing to Itanium's

"st.rel, mf"

So are you in agreement with me or with

http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2008.12.16a.html

<?>

<double wink>

>
> Note that if we actually got a chance to adjust the hardware, the trailing
> fence for a seq_cst store is actually much more than we need, and we could
> probably make this significantly cheaper.  These fences only have to
> order the preceding store with respect to a subsequent ATOMIC load.

Ha! Let's consider seq_cst cmpxchg for Power...

http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2008.12.16a.html

indicates something along the lines of

"hwsync; ldarx; cmp; bc _exit; stcwx; bc _loop; isync"

Under my interpretation, this is not strong enough either and should rather be

"hwsync; ldarx; cmp; bc _exit; stcwx; bc _loop; hwsync"

(with _exit on compare failure performing isync rather than hwsync)

which is quite similar with respect to store-load fencing to Itanium's

"mf, cmpxchg.acq, mf"

(in compare success case)

On Itanium, we could as well do

"mf, cmpxchg.rel, mf"

Do you agree?

>
> I think all of the recipes you have been posting need adjustments to
> remove the redundant leading fences for seq_cst loads.

I disagree.

regards,
alexander.