[cpp-threads] Alternatives to SC

Thu Jan 11 18:51:45 GMT 2007

Thanks, Doug. I appreciate the effort you're putting into illuminating the key choices, and that you have vastly more experience at this than many of us, including me.

I'm not sold on even the motivation for CCCC. Let me try to summarize the key issues. If we allow "loose IRIW" examples:

  1. What are we giving up? To answer that, we need to be clear about how much allowing IRIW may surprise programmers.

  2. What are we gaining? To answer that, we need to be clear about the utility of allowing IRIW. Sorry if I missed it, but has anyone done real measurements of that utility, and the actual performance benefit for specific cases on specific architectures?

Let me add some thoughts, mostly to clarify but in one case to complain about an unhelpful characterization in an otherwise very helpful email. Doug wrote:

> The question of whether "strong" volatiles/atomics are or should be
> required to be sequentially consistent has been around for years. In
> the absence of compelling evidence (at the time) about platform
> limitations, the JMM adopted simplicity and defined them as SC.  Now
> that processor verndors are slowly coming out with more believable and
> precise specs, this is in the midst of being re-examined. And should
> probably also be carefully thought out for C++.

I agree that it's being reexamined and should be carefully thought out. But I see the trend is toward SC, not away from it. I am seeing vendors, including Intel and ARM in private discussion over the past three months, who were non-SC or who had been planning to embark on a non-SC path now moving toward affirming they will support cheap ways to get SC. Hardware architects who were on the "weak" side are moving toward affirming that SC is reasonable and practical (including for performance), rather than the reverse. (I have in mind Mark Hill, Andy Glew, and others.)

> The issue tends to factionalize discussion.

Yes.

> SC purists cannot imagine weaker rules.

Come on, you know better. This characterization is dismissive, unhelpful, and untrue. Please stop repeating it.

We've lived with weaker rules for a generation, including probably millions of programmer-hours writing and painfully debugging many styles of explicit fences on many architectures. Asserting that people "cannot imagine" alternatives does a real disservice to those who've tried hard for years or decades to work with looser models and have come to the conclusion that we need SC. Besides many other good technical reasons, it's what programmers (including experts, and continually including ourselves as we have been discussing these examples in committee) expect even when they (we) know better.

The issue is not lack of imagination, nor is it any lack of experience with weaker alternatives. It is certainly not an issue of 'purity' or lack of imagination due to closed-mindedness or dimwittedness -- words more appropriate to dismiss others as zealots and fanatics.

> Hard-core concurrent programmers cannot understand what
> all the fuss is about.

(One could s/concurrent/assembly/ here.)

If it is a few dozen experts in the world who are able to do a very difficult task who cannot understand see what the trouble is for everyone else, part of the problem might lie in ivory towers or otherwise being out of touch. :-)

Doug, you're a super expert on this. Didn't you also personally write the bulk of the JSR-133 implementation(s)'s code? This interaction to you reminds me of talking to Dave Cutler, who often sees himself as just another good programmer, but who happens to have written more major OS kernels than any other human. Dave has difficulty understanding why what's routine for him is impenetrable to others and why it isn't reasonable to expect the same of even normal 'expert' programmers.

BTW, this comparison is intended not only as a reality check for perspective, but as a sincere compliment.

> A few months ago, it occurred to me that some of the unproductiveness
> of such discussions stems from not having a well-defined alternative
> to SC. So asked: what's the strongest model that appears to be
> supported using solely loads, stores, and fences, across all
> "reasonable" multiprocessors?  In other words, what is the model for a
> program in which all loads and stores are issued in program-order and
> are surrounded by the strongest possible fences?
[...]
> But the main thing to note is that it matches SC for all cases except
> those involving independent reads of independent writes (IRIW --
> see the derivation of the most famous example of this below.)
>
> If nothing else, I hope this focusses further discussion:

This is very helpful to be clearer about alternatives. Thank you very much for putting in this effort.

> Should sequential consistency be mandated for independent
> reads of independent writes?
>
> My view is no -- that CCCC would be a fine basis for defining strong
> volatiles/atomics. While I do appreciate the view
> that SC is simpler to express and has a better chance of being
> understood by programmers, constructions involving IRIW don't have any
> reasonable sequential analogs anyway, so

Yes, they do: Manipulation of a monolithic shared memory. For example, a whiteboard with a single pen is a fine analog.

> there's no SC intuition that needs preserving.

This wrong belief is a key conceptual stumbling block -- and, frankly, a blind spot. For one thing, it ignores reality; time and again we've seen that programmers treat memory monolithically and expect those semantics. We keep falling into it ourselves every so often when try to apply looser rules to examples, even though we know better.

> And as a working concurrent programmer, I think that
> the extra bit of concurrent readability CCCC allows hardware to
> exploit is a feature, not a bug.

With "concurrent readability" are you saying that the/an aim to improve concurrent read opportunities, or did I misunderstand? If so, I thought we're already far better at reads than writes. I might be more interested in a performance argument if it improved toleration of write latencies (if it didn't sacrifice the ability to have a usable programming model).

> (Note also that any program using locks (that are in turn created
> using CCCC-compliant primitives) is SC. It's only volatiles/atomics
> that allow concurrent independent reads.)

Granted. This whole discussion is about the programming model you have without locks, because locks are expensive and noncomposable enough that people do have legitimate reasons to want to avoid them. But those people who are trying to use lock-free to avoid the expense or potential deadlock of a lock should have a small step down to the next level -- they aren't asking to jump into the deep end. If locks are the kids' wading pool, SC atomics are the 4-foot shallow end (only adults can still walk, but adults can still walk), and explicit fences are open ocean. I'm not sure whether CCCC is at the 5-foot level where most adults can still keep their faces above water without swimming, or at 7+ feet where there's not much difference between 6-foot depth and the mid-atlantic trench in terms of how a human deals with it.

Bottom line: What are we giving up, and what are we gaining? (See top for the elaborated question.)

> **** CCCC model snapshot ****

I do like a model that can be given pretty rigorously in one page. :-)

Thanks for the clear explanation of CCCC.

Herb

>
> NOTATION:
>
> G = (S,-->) is a set of nodes S and a binary (directed) relation --> on
>     the nodes. Nodes are of type:
>       e ranges over read and write events,
>       r over read events and
>       w over write events.
> G is a partial order if --> is transitive and irreflexive.
> G is a total order if --> is a partial order and for any two
>       distinct nodes a and b in S, either a --> b or b --> a.
> G | T is graph G restricted to subset T; that is, the graph with nodes
>       in T and with a --> b in T iff a --> b in S.
> G* is the transitive closure of graph G.
> 'u' is set union
>
> p(e) is the processor which executed event e,
> E(p) is the set of events executed by processor p.
> W(x) is the set of all write events on variable x
> W    is the set of all write events (across all variables)
> w(S) is set S together with all elements of W with an edge into or out
> of S.
>
> RULES:
>
> An execution is a graph G=(S, -->) over a set of events S satisfying:
>
> [Write Serialization]
>    For each variable x, G | W(x) is a total order, with minimal
>    element w0 setting x to 0.
>
> [Freshness]
>    For every read r of x, min(W(x)) --> r and G | (W(x) u {r}) is a
>    total order.
>
> [Causal Propagation]
>    e --> e'' if there is an e' in E(p(e)) u E(p(e'')) st e --> e' -->
> e'' .
>
> [Local Consistency]
>    For all processors i, (G | w(E(i))* is a partial order (i.e., is
> acyclic)
>    Additionally, volatiles obey local program order:
>      For all processors p, G | E(p) is a total order.
>
>
> EXAMPLES:
>
> 1. Dekker (forbidden outcome)
>
> Given
>
> T1             T2
> [1] x = 1      [3] y = 1
> [2] r1 = y(0)  [4] r2 = x(0)
>
> Program Order gives
>    [1] --> [2], [3] --> [4],
> Write Serialization and Freshness give
>    [2] --> [3], [4] --> [1].
> Propagation adds
>    [3] --> [1], [1] --> [3] (and some other edges).
>
> This graph is not an execution because (G | {[1],[3],[4]})* has a
> cycle ([1] --> [3] --> [1])
>
> ======
>
> 2. IRIW (Independent reads of independent write -- allowed outcome)
>     Note that this is not SC.
>
>
> T1          T2         T3                T4
>
> [1] x = 1   [2] y = 1
>
>                        [3] r1 = x (1)    [5] r3 = y (1)
>                        [4] r2 = y (0)    [6] r4 = x (0)
>
> Program Order, Write Serialization, and Freshness give:
>    [1] --> [3] --> [4] --> [2]
>    [2] --> [5] --> [6] --> [1]
> Propagation adds
>    [2] --> [6], [5] --> [1] and [1] --> [4], [3] --> [2].
>
> This graph satisfies all these conditions. (Note that [1] --> [2] is
> in (G | W u E(T2))* and [2] --> [1] is in (G | W u E(T1)*), but both
> these graphs are separately acyclic.)
>
> =======
>
> 3. CC (Causal consistency -- forbidden outcome)
>
> T1           T2               T3
> [1] x = 1
>              [2] r1 = x (1)   [4] r2 = y (1)
>              [3] y = 1        [5] r3 = x (0)
>
> Program Order, Write Serialization, and Freshness give:
>   [1] --> [2] --> [3] --> [4] --> [5] --> [1].
> This is not an execution because
>   G | E u E(T3) = G | {[1],[3],[4],[5]}*
> has cycle [1]-->[3]-->[4] --> [5] --> [1].
>
> =======
>
> 4. DC (Another causal consistency example -- forbidden outcome)
>
> T1           T2               T3
> [1] y = 1
>              [2] x = 1        [4] r3 = y (1)
>              [3] r2 = y (0)   [5] r4 = x (0)
>
> Program Order, Write Serialization, and Freshness give:
>    [1] -> [4] --> [5] --> [2] --> [3] --> [1]
> Propagation adds
>    [2] --> [1] (among others).
> This is not an execution because
>    G | W u E(T3) = G | {[1],[2],[4],[5]}
> has the cycle [1] --> [4] --> [5] --> [2] --> [1].
>
>
>
>
>
> --
> cpp-threads mailing list
> cpp-threads at decadentplace.org.uk
> http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads