[cpp-threads] modes, pass 2

Wed May 11 14:14:19 BST 2005

On 5/11/05, Alexander Terekhov <alexander.terekhov at gmail.com> wrote:
[...]
> > > No comment. ;-) Well, but if you have access to IBM's w3forums, see my
> > > "PMFJI" message in "Apparent memory consistency in SMP Power5
> > > environment" thread at forums.hardware.powerpc...
> > >
> >
> > I don't think I do.
> > If you think it would help clear things up for the (many!) people
> > who read cookbook, could you please ask permission to send
> > me whatever clarifications this entails?
> 
> Well, I've just rambled a bit in reply to John McCalpin's message
> about memory model on Power and lwsync + isync. I'll ask him if
> it's OK to <Forward Quoted> his message to this list.

No problem, says John.

< Forward Quoted > (my "PMFJI" reply is <Forward Inline>'d below it)

-------
John McCalpin wrote:
> 
> In article <d29nrv$ik6$1 at w3forums1b.pok.ibm.com>,
>  Richard Brandle <rbrandl at us.ibm.com> wrote:
> 
> > If I interpret your reply correctly you are saying memory will not
> > appear consistent across processors for some unspecified period of time
> >   unless the application does something to insure the consistency.  OUCH!!
> 
> This is a fundamental attribute of any weakly consistent memory model.
> The PowerPC memory model is one of the weak ones, and this was chosen
> deliberately to allow the hardware a great deal of flexibility in
> dealing with the case where ordering between references is not actually
> required.
> 
> > Put chronologically, A reads the value, B modifies the value, A reads
> > the value again and may or may not get the updated value.  B reads the
> > value again and gets the updated value.  A reads the value a third time
> > and still may or may not get the updated value.  For some undefined
> > period of time the view of the memory content is not the same between
> > the two processors.  Is that correct?  If so that's ugly!  Other SMP
> > architectures don't act this way do they?
> 
> This is going to be the case on almost any architecture -- even the most
> strongly ordered ones require additional synchronization.
> 
> The usual sequence in a producer/consumer relationship is
>   . producer writes the data that it wants to send
>   . producer writes to a separate memory location that acts as a "flag"
> that says the real data is ready
> 
>   . the consumer spins on the "flag" until it is reset
>   . then the consumer can read the data
> 
> Of course locks are a more general case of "flags", but for the simple
> case of pairwise communication you don't need the full overhead of a
> lock.
> 
> On PowerPC systems, extra steps must be taken to ensure that the items
> above happen in precisely the correct sequence.
> The straightforward implementation is:
>  . producer writes the data it wants to send
>  . producer executes a "lightweight sync"  (lwsync) instruction
>  . producer writes the flag
> 
>   . consumer spins on the flag until it is reset
>   . consumer executes an "isync" to discard any subsequent speculative
> memory reads
>   . consumer reads the data
> 
> There are some sneaky tricks that can be used to avoid some of the sync
> instructions (by creating false data dependencies), but this is probably
> a bad idea in general.   It certainly makes the intent of the code less
> clear, and can introduce extremely subtle coherency bugs that may only
> appear in future products.
> 
> > What if the application doing
> > the reading doesn't "know" the memory is being shared?
> 
> Then you almost certainly have an incorrect program.
> 
> > You said "use the locking sequences to synchronize access" (need to look
> > into those) but what does this do to the performance?  If I have a
> > multi-threaded app how can I ever be sure I'm getting a consistent
> > memory view without these?  Probably 99.9%+ of the time they wouldn't be
> > needed but how could you ever know?  If I'm using a HLL ("C" or C++) are
> > there language extensions to have the compiler generate the
> > synchronization necessary?
> 
> The most common programming models for shared memory use one of two
> approaches:
> 
> 1. Use explicit locks to ensure atomic access to each data structure in
> the shared memory space.   The locks are typically required for writing,
> but can sometimes be optional for readers -- e.g., check a "version
> number" of the record before and after the read to see if it was
> modified while you were reading it.
> 
> or
> 
> 2. Split program execution into large phases during which you are
> guaranteed that the data that is shared is either read or written, but
> not both. OpenMP is an example of such an approach, as are the whole
> class of Bulk Synchronous Parallel (BSP) programming models.
> 
> The performance penalties associated with fine-grain sharing pretty much
> force you to adopt some such coarse-grained approach.
> 
> An alternate approach was taken by Burton Smith in the Tera MTA
> architecture -- each 64-bit memory location contained a "full/empty" bit
> that could be used to control synchronization with extremely fine
> granularity.   There are a variety of architectural approaches that are
> similar to "full/empty", such as FIFOs and "valid/invalid" bits.  We
> don't do any of those things here.
> 
> > Resource locking for multiple writers to shared memory is trivial
> > compared to this.  WOW - you're correct -- this is mind bending!  Now I
> > have a headache! :-)
> 
> It is really not so bad -- you just need to remember that all
> communication requires explicit synchronization.  The only ordering that
> the hardware enforces is for accesses by a single thread.   I.e., if a
> process writes a value to a (non-shared) memory location and then reads
> it later, it is guaranteed that the read will follow the write and
> obtain the correct value.
> 
> > - Richard
> >
> > Marc Auslander wrote:
> > > Richard Brandle <rbrandl at us.ibm.com> writes:
> > >
> > >
> > >>I'm trying to get my head around the Power5 architecture.  On first read
> > >>of the documentation they suggest memory consistency could be a problem.
> > >>  A read of different docs suggest it isn't.
> > >
> > > ...
> > >
> > >>the question.  Thread A reads memory location X and gets value V.  On a
> > >>different processor, Thread B writes memory location X with value W.  If
> > >>thread A (still running on it's original processor) reads memory
> > >>location X again, does it get the cached value V or the actual memory
> > >>value W?
> > >>
> > >>This is NOT a question related to software locks on shared objects to
> > >>avoid simultaneous updates.  This is purely related to the hardware.
> > >>
> > >>- Richard
> > >
> > >
> > > This is a mind bending subject.  You ask about power5 - but I suggest
> > > you want to write programs which conform to the PPC architecture.  In
> > > the architecture documents, there is a lot of difficult verbiage about
> > > synchronization.
> > >
> > > In your example, the key issue is, what to you mean  by A reads memory
> > > again?  In fact, I think you mean - A read memory again AFTER B
> > > updates it.  Which leads to the question - what to you mean by after?
> > > One plausible answer is that A reads after the store has been
> > > "performed" IF it gets the new value!  AFAIK, every implementation
> > > eventually performs every store - although I've not  been able to find
> > > a guarantee of that in the architecture.  So, AFAIK, if you wait long
> > > enough A will fetch either the value stored by B or some other value
> > > stored after that.
> > >
> > > My advice - if at all possible, use the locking sequences given in the
> > > architecture to synchronize access to data across processors.
-------

< Forward Inline >

-------- Original Message --------
Newsgroups: forums.hardware.powerpc
Subject: Re: Apparent memory consistency in SMP Power5 environment
Date: Wed, 13 Apr 2005 17:19:17 +0200
Message-ID: <425D3875.C91D52A1 at web.de>
References: ... <mccalpin-B14CC6.14594712042005 at w3forums2b.pok.ibm.com>

PMFJI, but msync on Power is so 20th century...

John McCalpin wrote:
[...]
> The straightforward implementation is:
>  . producer writes the data it wants to send
>  . producer executes a "lightweight sync"  (lwsync) instruction

which is a "full" (apart from store-load) bidirectional fence. It's way 
too "heavy". A labeled store imposing unidirectional "sinking" barrier 
for preceding stores and nothing more would be optimal here.

>  . producer writes the flag
> 
>   . consumer spins on the flag until it is reset
>   . consumer executes an "isync" to discard any subsequent speculative

And why the heck discard anything? You merely need to impose ordering 
of subsequent loads (stores aside for a moment) with respect to load
on flag. So just label it (load on flag) and impose unidirectional 
hoisting constraint for subsequent loads without abuse of context 
synchronization instruction. 

>   . consumer reads the data
> 
> There are some sneaky tricks that can be used to avoid some of the sync
> instructions (by creating false data dependencies), but this is probably
> a bad idea in general.   

Safe Fetch ("branch never taken" trick) to impose load->store
ordering is documented in Book II. 

Oh, Ah, BTW, what's the fuss with dangling larxs on 970? 

http://tinyurl.com/4gk8f

Top secret or what?

regards,
alexander.

--
http://tinyurl.com/67c7v