[cpp-threads] modes, pass 2
Alexander Terekhov
alexander.terekhov at gmail.com
Wed May 11 14:14:19 BST 2005
On 5/11/05, Alexander Terekhov <alexander.terekhov at gmail.com> wrote:
[...]
> > > No comment. ;-) Well, but if you have access to IBM's w3forums, see my
> > > "PMFJI" message in "Apparent memory consistency in SMP Power5
> > > environment" thread at forums.hardware.powerpc...
> > >
> >
> > I don't think I do.
> > If you think it would help clear things up for the (many!) people
> > who read cookbook, could you please ask permission to send
> > me whatever clarifications this entails?
>
> Well, I've just rambled a bit in reply to John McCalpin's message
> about memory model on Power and lwsync + isync. I'll ask him if
> it's OK to <Forward Quoted> his message to this list.
No problem, says John.
< Forward Quoted > (my "PMFJI" reply is <Forward Inline>'d below it)
-------
John McCalpin wrote:
>
> In article <d29nrv$ik6$1 at w3forums1b.pok.ibm.com>,
> Richard Brandle <rbrandl at us.ibm.com> wrote:
>
> > If I interpret your reply correctly you are saying memory will not
> > appear consistent across processors for some unspecified period of time
> > unless the application does something to insure the consistency. OUCH!!
>
> This is a fundamental attribute of any weakly consistent memory model.
> The PowerPC memory model is one of the weak ones, and this was chosen
> deliberately to allow the hardware a great deal of flexibility in
> dealing with the case where ordering between references is not actually
> required.
>
> > Put chronologically, A reads the value, B modifies the value, A reads
> > the value again and may or may not get the updated value. B reads the
> > value again and gets the updated value. A reads the value a third time
> > and still may or may not get the updated value. For some undefined
> > period of time the view of the memory content is not the same between
> > the two processors. Is that correct? If so that's ugly! Other SMP
> > architectures don't act this way do they?
>
> This is going to be the case on almost any architecture -- even the most
> strongly ordered ones require additional synchronization.
>
> The usual sequence in a producer/consumer relationship is
> . producer writes the data that it wants to send
> . producer writes to a separate memory location that acts as a "flag"
> that says the real data is ready
>
> . the consumer spins on the "flag" until it is reset
> . then the consumer can read the data
>
> Of course locks are a more general case of "flags", but for the simple
> case of pairwise communication you don't need the full overhead of a
> lock.
>
> On PowerPC systems, extra steps must be taken to ensure that the items
> above happen in precisely the correct sequence.
> The straightforward implementation is:
> . producer writes the data it wants to send
> . producer executes a "lightweight sync" (lwsync) instruction
> . producer writes the flag
>
> . consumer spins on the flag until it is reset
> . consumer executes an "isync" to discard any subsequent speculative
> memory reads
> . consumer reads the data
>
> There are some sneaky tricks that can be used to avoid some of the sync
> instructions (by creating false data dependencies), but this is probably
> a bad idea in general. It certainly makes the intent of the code less
> clear, and can introduce extremely subtle coherency bugs that may only
> appear in future products.
>
> > What if the application doing
> > the reading doesn't "know" the memory is being shared?
>
> Then you almost certainly have an incorrect program.
>
> > You said "use the locking sequences to synchronize access" (need to look
> > into those) but what does this do to the performance? If I have a
> > multi-threaded app how can I ever be sure I'm getting a consistent
> > memory view without these? Probably 99.9%+ of the time they wouldn't be
> > needed but how could you ever know? If I'm using a HLL ("C" or C++) are
> > there language extensions to have the compiler generate the
> > synchronization necessary?
>
> The most common programming models for shared memory use one of two
> approaches:
>
> 1. Use explicit locks to ensure atomic access to each data structure in
> the shared memory space. The locks are typically required for writing,
> but can sometimes be optional for readers -- e.g., check a "version
> number" of the record before and after the read to see if it was
> modified while you were reading it.
>
> or
>
> 2. Split program execution into large phases during which you are
> guaranteed that the data that is shared is either read or written, but
> not both. OpenMP is an example of such an approach, as are the whole
> class of Bulk Synchronous Parallel (BSP) programming models.
>
> The performance penalties associated with fine-grain sharing pretty much
> force you to adopt some such coarse-grained approach.
>
> An alternate approach was taken by Burton Smith in the Tera MTA
> architecture -- each 64-bit memory location contained a "full/empty" bit
> that could be used to control synchronization with extremely fine
> granularity. There are a variety of architectural approaches that are
> similar to "full/empty", such as FIFOs and "valid/invalid" bits. We
> don't do any of those things here.
>
> > Resource locking for multiple writers to shared memory is trivial
> > compared to this. WOW - you're correct -- this is mind bending! Now I
> > have a headache! :-)
>
> It is really not so bad -- you just need to remember that all
> communication requires explicit synchronization. The only ordering that
> the hardware enforces is for accesses by a single thread. I.e., if a
> process writes a value to a (non-shared) memory location and then reads
> it later, it is guaranteed that the read will follow the write and
> obtain the correct value.
>
> > - Richard
> >
> > Marc Auslander wrote:
> > > Richard Brandle <rbrandl at us.ibm.com> writes:
> > >
> > >
> > >>I'm trying to get my head around the Power5 architecture. On first read
> > >>of the documentation they suggest memory consistency could be a problem.
> > >> A read of different docs suggest it isn't.
> > >
> > > ...
> > >
> > >>the question. Thread A reads memory location X and gets value V. On a
> > >>different processor, Thread B writes memory location X with value W. If
> > >>thread A (still running on it's original processor) reads memory
> > >>location X again, does it get the cached value V or the actual memory
> > >>value W?
> > >>
> > >>This is NOT a question related to software locks on shared objects to
> > >>avoid simultaneous updates. This is purely related to the hardware.
> > >>
> > >>- Richard
> > >
> > >
> > > This is a mind bending subject. You ask about power5 - but I suggest
> > > you want to write programs which conform to the PPC architecture. In
> > > the architecture documents, there is a lot of difficult verbiage about
> > > synchronization.
> > >
> > > In your example, the key issue is, what to you mean by A reads memory
> > > again? In fact, I think you mean - A read memory again AFTER B
> > > updates it. Which leads to the question - what to you mean by after?
> > > One plausible answer is that A reads after the store has been
> > > "performed" IF it gets the new value! AFAIK, every implementation
> > > eventually performs every store - although I've not been able to find
> > > a guarantee of that in the architecture. So, AFAIK, if you wait long
> > > enough A will fetch either the value stored by B or some other value
> > > stored after that.
> > >
> > > My advice - if at all possible, use the locking sequences given in the
> > > architecture to synchronize access to data across processors.
-------
< Forward Inline >
-------- Original Message --------
Newsgroups: forums.hardware.powerpc
Subject: Re: Apparent memory consistency in SMP Power5 environment
Date: Wed, 13 Apr 2005 17:19:17 +0200
Message-ID: <425D3875.C91D52A1 at web.de>
References: ... <mccalpin-B14CC6.14594712042005 at w3forums2b.pok.ibm.com>
PMFJI, but msync on Power is so 20th century...
John McCalpin wrote:
[...]
> The straightforward implementation is:
> . producer writes the data it wants to send
> . producer executes a "lightweight sync" (lwsync) instruction
which is a "full" (apart from store-load) bidirectional fence. It's way
too "heavy". A labeled store imposing unidirectional "sinking" barrier
for preceding stores and nothing more would be optimal here.
> . producer writes the flag
>
> . consumer spins on the flag until it is reset
> . consumer executes an "isync" to discard any subsequent speculative
And why the heck discard anything? You merely need to impose ordering
of subsequent loads (stores aside for a moment) with respect to load
on flag. So just label it (load on flag) and impose unidirectional
hoisting constraint for subsequent loads without abuse of context
synchronization instruction.
> . consumer reads the data
>
> There are some sneaky tricks that can be used to avoid some of the sync
> instructions (by creating false data dependencies), but this is probably
> a bad idea in general.
Safe Fetch ("branch never taken" trick) to impose load->store
ordering is documented in Book II.
Oh, Ah, BTW, what's the fuss with dangling larxs on 970?
http://tinyurl.com/4gk8f
Top secret or what?
regards,
alexander.
--
http://tinyurl.com/67c7v
More information about the cpp-threads
mailing list