[cpp-threads] Implementation of seq_cst load on x86

Wed Jun 2 16:44:48 BST 2010

On Wed, May 19, 2010 at 01:54:00PM -0700, Jeffrey Yasskin wrote:
> Hello all,
> 
> I've seen assertions [1] that load(memory_order_seq_cst) can be
> implemented as a simple mov on x86 as long as
> store(memory_order_seq_cst) is implemented as xchg. Does anyone know
> of a paper that proves this claim from the guarantees in Intel's and
> AMD's architecture documents?
> 
> Sorry if this was the wrong list to ask this question on.
> 
> Thanks,
> Jeffrey Yasskin
> 
> [1] http://www.justsoftwaresolutions.co.uk/threading/intel-memory-ordering-and-c++-memory-model.html
> and http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html

For Intel, see section 8.2.2 page 8-10 of:

	http://download.intel.com/design/processor/manuals/253668.pdf

where it says "Locked instructions have a total order."  This takes care
of your stores.

In section 8.1.1 on page 8-3 of the same document, we see that any aligned
read of a basic type is atomic, as is any unaligned read that is wholly
enclosed in a single cache line.  Exceptions to this rule include SSE
instructions and x87 floating-point operations that access more than 64
bits of data.

So, any given load will return data from up to eight stores (though for
the sake of those who read your code, I hope you stick mainly with loads
and stores of the same sizes to any given location).  Because these
stores were carried out using locked xchg, they are totally ordered.
This load can then be assigned any place in the global ordering following
the last store that it read and preceding the next store that would have
otherwise affected the value returned by that load.

Hence, if all stores use xchg (or any other locked instruction) and if
all loads are properly aligned and sized, you do get SC.

For AMD, the wording is less clear.  Find "AMD x86-64 Architecture
Programmer's Manual Volume 2: System Programming" from amd.com, and look
at table the litmus test at the beginning of page 166:

  Processor 0    Processor 1	Processor X Processor Y
  LOCK XCHG A, 1 LOCK XCHG B, 1
                                Load A (1)  Load B (1)
				Load B (0)  Load A (0)

The text says that this outcome is prohibited, which is consistent
with normal loads and locked-xchg stores providing SC outcomes.

Hope this helps...

							Thanx, Paul