[cpp-threads] Alternatives to SC

Fri Jan 12 01:58:51 GMT 2007

On Thu, Jan 11, 2007 at 10:51:45AM -0800, Herb Sutter wrote:
> Thanks, Doug. I appreciate the effort you're putting into illuminating the key choices, and that you have vastly more experience at this than many of us, including me.
> 
> I'm not sold on even the motivation for CCCC. Let me try to summarize
> the key issues. If we allow "loose IRIW" examples:

I am anything but sold on SC, so we are even.

>   1. What are we giving up? To answer that, we need to be clear about
>      how much allowing IRIW may surprise programmers.

Since we don't have SC now, a more appropriate answer is "what would SC
buy us?".  Thus far, your answers have not been convincing to me.

>   2. What are we gaining? To answer that, we need to be clear about
>      the utility of allowing IRIW. Sorry if I missed it, but has anyone
>      done real measurements of that utility, and the actual performance
>      benefit for specific cases on specific architectures?

Similarly, given that we don't have SC now, a more appropriate question
is "how much do we lose by insisting on SC?".  The answer varies wildly
with the specific algorithm, but in my experience, an order of magnitude
is a good rule of thumb for software enforcement.  To see this, take
the example that Peter Dimov recently posted to this list and generalize
to a large number of threads -- have N threads write to N independent
non-false-shared variables, and have two other threads concurrently read
the variables in reverse order from each other, then check for consistency
with SC.  Run on an SMP machine with N+2 hardware threads.  For extra
credit, place the two reader threads on different dies, but each sharing
their die with some of the writer threads.

Do the two readers see results consistent with a single order for
all of the writes?  How much overhead do you have to add to get this
example to reliably produce results consistent with SC?

> Let me add some thoughts, mostly to clarify but in one case to
> complain about an unhelpful characterization in an otherwise very
> helpful email. Doug wrote:
> 
> > The question of whether "strong" volatiles/atomics are or should be
> > required to be sequentially consistent has been around for years. In
> > the absence of compelling evidence (at the time) about platform
> > limitations, the JMM adopted simplicity and defined them as SC.  Now
> > that processor verndors are slowly coming out with more believable and
> > precise specs, this is in the midst of being re-examined. And should
> > probably also be carefully thought out for C++.
> 
> I agree that it's being reexamined and should be carefully thought out.
> But I see the trend is toward SC, not away from it. I am seeing vendors,
> including Intel and ARM in private discussion over the past three months,
> who were non-SC or who had been planning to embark on a non-SC path now
> moving toward affirming they will support cheap ways to get SC. Hardware
> architects who were on the "weak" side are moving toward affirming that
> SC is reasonable and practical (including for performance), rather than
> the reverse. (I have in mind Mark Hill, Andy Glew, and others.)

Towards TSO, maybe.  In light of the example above, I have a hard
time believing they will really get all the way to SC.  However, I
am double-checking with my contacts at a number of prominent CPU and
systems vendors.

> > The issue tends to factionalize discussion.
> 
> Yes.
> 
> > SC purists cannot imagine weaker rules.
> 
> Come on, you know better. This characterization is dismissive, unhelpful, and untrue. Please stop repeating it.
> 
> We've lived with weaker rules for a generation, including probably millions of programmer-hours writing and painfully debugging many styles of explicit fences on many architectures. Asserting that people "cannot imagine" alternatives does a real disservice to those who've tried hard for years or decades to work with looser models and have come to the conclusion that we need SC. Besides many other good technical reasons, it's what programmers (including experts, and continually including ourselves as we have been discussing these examples in committee) expect even when they (we) know better.
> 
> The issue is not lack of imagination, nor is it any lack of experience with weaker alternatives. It is certainly not an issue of 'purity' or lack of imagination due to closed-mindedness or dimwittedness -- words more appropriate to dismiss others as zealots and fanatics.
> 
> > Hard-core concurrent programmers cannot understand what
> > all the fuss is about.
> 
> (One could s/concurrent/assembly/ here.)

Please re-read your own good advice above, and please start acting on it.

That said, I plead guilty to being a hard-core concurrent programmer.

> If it is a few dozen experts in the world who are able to do a very
> difficult task who cannot understand see what the trouble is for everyone
> else, part of the problem might lie in ivory towers or otherwise being
> out of touch. :-)

I would instead argue that there are a number of people who have been
attempting over the past 20 years to codify parallel programming down
at the CAS level.  To quote a one Satnam Singh of Microsoft:

	Non-blocking code: we've tried it: it's mind-bending!

http://www.cs.rochester.edu/research/synchronization/2005-10_SCOOL_panel_slides.pdf

See slide 26.

You are asserting that SC will help us.  I remain unconvinced.  It seems
much more likely that higher levels of abstraction are what is needed
instead.  SC has certainly been shown to aid proofs of correctness, but
let's face the fact that any programming paradigm that requires proofs
of correctness from the typical practitioner is doomed to failure.

And, yes, SC does make construction of non-blocking algorithms easier,
but so what?  With a few heavily used exceptions, such algorithms are
so complex (even -with- SC) as to be useless in practice -- it requires
proofs of correctness from its practitioners.

> Doug, you're a super expert on this. Didn't you also personally write the bulk of the JSR-133 implementation(s)'s code? This interaction to you reminds me of talking to Dave Cutler, who often sees himself as just another good programmer, but who happens to have written more major OS kernels than any other human. Dave has difficulty understanding why what's routine for him is impenetrable to others and why it isn't reasonable to expect the same of even normal 'expert' programmers.
> 
> BTW, this comparison is intended not only as a reality check for perspective, but as a sincere compliment.
> 
> > A few months ago, it occurred to me that some of the unproductiveness
> > of such discussions stems from not having a well-defined alternative
> > to SC. So asked: what's the strongest model that appears to be
> > supported using solely loads, stores, and fences, across all
> > "reasonable" multiprocessors?  In other words, what is the model for a
> > program in which all loads and stores are issued in program-order and
> > are surrounded by the strongest possible fences?
> [...]
> > But the main thing to note is that it matches SC for all cases except
> > those involving independent reads of independent writes (IRIW --
> > see the derivation of the most famous example of this below.)
> >
> > If nothing else, I hope this focusses further discussion:
> 
> This is very helpful to be clearer about alternatives. Thank you very
> much for putting in this effort.

Whatever other disagreements I might have with Herb Sutter, I can only
second the above thanks to Doug Lea in the strongest possible terms.

> > Should sequential consistency be mandated for independent
> > reads of independent writes?
> >
> > My view is no -- that CCCC would be a fine basis for defining strong
> > volatiles/atomics. While I do appreciate the view
> > that SC is simpler to express and has a better chance of being
> > understood by programmers, constructions involving IRIW don't have any
> > reasonable sequential analogs anyway, so
> 
> Yes, they do: Manipulation of a monolithic shared memory. For example, a whiteboard with a single pen is a fine analog.
> 
> > there's no SC intuition that needs preserving.
> 
> This wrong belief is a key conceptual stumbling block -- and, frankly,
> a blind spot. For one thing, it ignores reality; time and again we've
> seen that programmers treat memory monolithically and expect those
> semantics. We keep falling into it ourselves every so often when try to
> apply looser rules to examples, even though we know better.

Perhaps it is worth considering higher-level abstractions, not necessarily
part of a low-level standard.

Better yet, let's all carefully review Doug's model before passing
judgement on it.

> > And as a working concurrent programmer, I think that
> > the extra bit of concurrent readability CCCC allows hardware to
> > exploit is a feature, not a bug.
> 
> With "concurrent readability" are you saying that the/an aim to improve concurrent read opportunities, or did I misunderstand? If so, I thought we're already far better at reads than writes. I might be more interested in a performance argument if it improved toleration of write latencies (if it didn't sacrifice the ability to have a usable programming model).
> 
> > (Note also that any program using locks (that are in turn created
> > using CCCC-compliant primitives) is SC. It's only volatiles/atomics
> > that allow concurrent independent reads.)
> 
> Granted. This whole discussion is about the programming model you
> have without locks, because locks are expensive and noncomposable enough
> that people do have legitimate reasons to want to avoid them. But those
> people who are trying to use lock-free to avoid the expense or potential
> deadlock of a lock should have a small step down to the next level --
> they aren't asking to jump into the deep end. If locks are the kids'
>  wading pool, SC atomics are the 4-foot shallow end (only adults can
> still walk, but adults can still walk), and explicit fences are open
> ocean. I'm not sure whether CCCC is at the 5-foot level where most
> adults can still keep their faces above water without swimming, or at
> 7+ feet where there's not much difference between 6-foot depth and the
> mid-atlantic trench in terms of how a human deals with it.

You might want to look at RCU as an example of an abstraction that is much
easier to deal with than even SC, but that provides better performance
than does locking.  Used pretty heavily in the Linux kernel.

In a less self-serving vein, how about the SQL database query language?
All the difficult concurrency stuff happens in the database kernel,
freeing the programmer from all sorts of concerns.

My fear is that your fixation on SC is blinding you to the options that
will give large and very real productivity benefits, possibly also
coupled with performance increases.

> Bottom line: What are we giving up, and what are we gaining? (See top
> for the elaborated question.)

Yes.  What are we giving up and what are we gaining by moving from the
current state -- which is -not- SC.  You are proposing SC, so in my
opinion you owe us the data on what SC gives up and what it gains.

Not the other way around.

> > **** CCCC model snapshot ****
> 
> I do like a model that can be given pretty rigorously in one page. :-)
> 
> Thanks for the clear explanation of CCCC.

A couple more things that we agree on!  ;-)

							Thanx, Paul

> Herb
> 
> 
> >
> > NOTATION:
> >
> > G = (S,-->) is a set of nodes S and a binary (directed) relation --> on
> >     the nodes. Nodes are of type:
> >       e ranges over read and write events,
> >       r over read events and
> >       w over write events.
> > G is a partial order if --> is transitive and irreflexive.
> > G is a total order if --> is a partial order and for any two
> >       distinct nodes a and b in S, either a --> b or b --> a.
> > G | T is graph G restricted to subset T; that is, the graph with nodes
> >       in T and with a --> b in T iff a --> b in S.
> > G* is the transitive closure of graph G.
> > 'u' is set union
> >
> > p(e) is the processor which executed event e,
> > E(p) is the set of events executed by processor p.
> > W(x) is the set of all write events on variable x
> > W    is the set of all write events (across all variables)
> > w(S) is set S together with all elements of W with an edge into or out
> > of S.
> >
> > RULES:
> >
> > An execution is a graph G=(S, -->) over a set of events S satisfying:
> >
> > [Write Serialization]
> >    For each variable x, G | W(x) is a total order, with minimal
> >    element w0 setting x to 0.
> >
> > [Freshness]
> >    For every read r of x, min(W(x)) --> r and G | (W(x) u {r}) is a
> >    total order.
> >
> > [Causal Propagation]
> >    e --> e'' if there is an e' in E(p(e)) u E(p(e'')) st e --> e' -->
> > e'' .
> >
> > [Local Consistency]
> >    For all processors i, (G | w(E(i))* is a partial order (i.e., is
> > acyclic)
> >    Additionally, volatiles obey local program order:
> >      For all processors p, G | E(p) is a total order.
> >
> >
> > EXAMPLES:
> >
> > 1. Dekker (forbidden outcome)
> >
> > Given
> >
> > T1             T2
> > [1] x = 1      [3] y = 1
> > [2] r1 = y(0)  [4] r2 = x(0)
> >
> > Program Order gives
> >    [1] --> [2], [3] --> [4],
> > Write Serialization and Freshness give
> >    [2] --> [3], [4] --> [1].
> > Propagation adds
> >    [3] --> [1], [1] --> [3] (and some other edges).
> >
> > This graph is not an execution because (G | {[1],[3],[4]})* has a
> > cycle ([1] --> [3] --> [1])
> >
> > ======
> >
> > 2. IRIW (Independent reads of independent write -- allowed outcome)
> >     Note that this is not SC.
> >
> >
> > T1          T2         T3                T4
> >
> > [1] x = 1   [2] y = 1
> >
> >                        [3] r1 = x (1)    [5] r3 = y (1)
> >                        [4] r2 = y (0)    [6] r4 = x (0)
> >
> > Program Order, Write Serialization, and Freshness give:
> >    [1] --> [3] --> [4] --> [2]
> >    [2] --> [5] --> [6] --> [1]
> > Propagation adds
> >    [2] --> [6], [5] --> [1] and [1] --> [4], [3] --> [2].
> >
> > This graph satisfies all these conditions. (Note that [1] --> [2] is
> > in (G | W u E(T2))* and [2] --> [1] is in (G | W u E(T1)*), but both
> > these graphs are separately acyclic.)
> >
> > =======
> >
> > 3. CC (Causal consistency -- forbidden outcome)
> >
> > T1           T2               T3
> > [1] x = 1
> >              [2] r1 = x (1)   [4] r2 = y (1)
> >              [3] y = 1        [5] r3 = x (0)
> >
> > Program Order, Write Serialization, and Freshness give:
> >   [1] --> [2] --> [3] --> [4] --> [5] --> [1].
> > This is not an execution because
> >   G | E u E(T3) = G | {[1],[3],[4],[5]}*
> > has cycle [1]-->[3]-->[4] --> [5] --> [1].
> >
> > =======
> >
> > 4. DC (Another causal consistency example -- forbidden outcome)
> >
> > T1           T2               T3
> > [1] y = 1
> >              [2] x = 1        [4] r3 = y (1)
> >              [3] r2 = y (0)   [5] r4 = x (0)
> >
> > Program Order, Write Serialization, and Freshness give:
> >    [1] -> [4] --> [5] --> [2] --> [3] --> [1]
> > Propagation adds
> >    [2] --> [1] (among others).
> > This is not an execution because
> >    G | W u E(T3) = G | {[1],[2],[4],[5]}
> > has the cycle [1] --> [4] --> [5] --> [2] --> [1].
> >
> >
> >
> >
> >
> > --
> > cpp-threads mailing list
> > cpp-threads at decadentplace.org.uk
> > http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads
> 
> --
> cpp-threads mailing list
> cpp-threads at decadentplace.org.uk
> http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads