[cpp-threads] Re: A hopefully clearer document on POSIX threads

Sun Feb 19 18:26:52 GMT 2006

Bother.  I missed this one.

From: "Boehm, Hans" <hans.boehm at hp.com>
Subject: RE: Re[2]: [cxxpanel] RE: A hopefully clearer document on POSIX threads

> -----Original Message-----
> From: Valentin Samko [mailto:cxxpanel.valentin at samko.info] 
>
> HB> Assuming you mean what's also called a "memory fence", i.e. an 
> HB> instruction that enforces certain kinds of inter-thread memory 
> HB> ordering, then that's generally implemented locally, I 
> believe.  I 
> HB> don't think it gets dramatically more expensive in large machines.
> 
> I meant the memory fence which guarantees CPU cache coherency 
> and that read/write reordering does not cause problems. This 
> is not an issue right now on x86 architectures as the CPU 
> cache coherency is guaranteed by the hardware (but reordering 
> still occurs), but on Itanium only CPU data cache coherency 
> is guaranteed, and I think that such guarantees do not make 
> much sense on massively multi cpu/core machines.
> 
> For example, I understand that "interlocked" operations use 
> such memory fences.
> How can these interlocked operations can be performed with 
> local memory barriers (i.e. local to a few CPUs), considering 
> that other processes may be using that memory (if it is 
> shared) as well?
> 
> Don't we need to guarantee the CPU cache coherency across all 
> the CPU's?
I think this is getting a bit off topic, but ...

The machines we're directly addressing are all cache-coherent, i.e. the
hardware assures that all the caches are consistent.  Typically this
means that a machine can only update a cache line after acquiring
exclusive ownership of the line, thus ensuring it is cached nowhere
else.

This does not ensure sequential consistency since, for example, all
modern processors contain a store buffer, and thus stores appear to
complete, and the result becomes locally visible, before the value makes
it into the cache, or becomes visible to other processors.  

On Itanium, the memory fence instruction ensures that memory operations
issued before the fence become visible to other processors before those
issued after the fence.  This may mean that the processor has to wait
for store buffers to make it to the cache.  That may involve processing
a cache miss, which may take longer in large machines.  I think there
are other more complex, but qualitatively similar issues as well.  (I'm
not a hardware architect.)  But I don't think there's anything here that
scales worse than cache misses.

Itanium does generally have a more relaxed memory model than X86.  But
X86 can also make a store followed by a load visible out of order, so
this isn't really a new issue.

"Interlocked" instructions are generally also implemented by first
getting exclusive access to a cache line, and then performing the
operation locally.  Similar observations apply.
> 
> I understand that NUMA is designed to fix this type of 
> problems, but one needs to adapt the source code or the 
> language to make a use of it.
Cache coherent NUMA machines (e.g. multiprocessor Opterons) are
generally programmed like SMPs, at least initially.  For performance
critical code, you may want to pay attention to data placement.  But
that generally affects only the performance of cache misses.  Depending
on the code and the difference in latencies, you may be able to get away
with ignoring that altogether.  I believe many applications do ignore
it.

Hans

To unsubscribe from this group, send an email to:
cxxpanel-unsubscribe at yahoogroups.co.uk

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://uk.groups.yahoo.com/group/cxxpanel/

<*> To unsubscribe from this group, send an email to:
    cxxpanel-unsubscribe at yahoogroups.co.uk

<*> Your use of Yahoo! Groups is subject to:
    http://uk.docs.yahoo.com/info/terms.html