[cpp-threads] Alternatives to SC

Thu Jan 11 13:01:27 GMT 2007

[I've been meaning to post this for a month or so, but have been
diverted doing other things...]

The question of whether "strong" volatiles/atomics are or should be
required to be sequentially consistent has been around for years. In
the absence of compelling evidence (at the time) about platform
limitations, the JMM adopted simplicity and defined them as SC.  Now
that processor verndors are slowly coming out with more believable and
precise specs, this is in the midst of being re-examined. And should
probably also be carefully thought out for C++.

The issue tends to factionalize discussion. SC purists cannot imagine
weaker rules. Hard-core concurrent programmers cannot understand what
all the fuss is about.

A few months ago, it occurred to me that some of the unproductiveness
of such discussions stems from not having a well-defined alternative
to SC. So asked: what's the strongest model that appears to be
supported using solely loads, stores, and fences, across all
"reasonable" multiprocessors?  In other words, what is the model for a
program in which all loads and stores are issued in program-order and
are surrounded by the strongest possible fences?

The answer to this, along with extensions (for mixtures with non
volatile/atomics, CAS etc) is still a work in progress, but seems to
be an attractive option for memory models:
      Cache-coherent causal consistency.

This models two properties that not only does everyone think are
necessary, but also seem to be universally supported. It would be
nice if the hardware specs would come out and explicitly say so.

Cache coherence  in turn has two parts:
   * Globally totally ordered writes
   * Fresh reads (never read a value that has been overritten)
As does causal consistency:
   * Causality:
      If p2 sees p1's write, and p3 sees p2's write, p3 sees p1's write
   * Local acyclicity: Each thread sees an acyclic order of events that
      is consistent with program order.

(Since starting on this, I've found that this model has been around in
various guises for over 20 years, but apparently never fleshed out enough
to serve as a basis for language-level memory models.)

The current best formal version of model is appended below.
(Thanks mainly to Vijay Saraswat for help with this.)
Like any memory model, it needs a bunch of notation to get across.
But the main thing to note is that it matches SC for all cases except
those involving independent reads of independent writes (IRIW --
see the derivation of the most famous example of this below.)

If nothing else, I hope this focusses further discussion:

Should sequential consistency be mandated for independent
reads of independent writes?

My view is no -- that CCCC would be a fine basis for defining strong
volatiles/atomics. While I do appreciate the view
that SC is simpler to express and has a better chance of being
understood by programmers, constructions involving IRIW don't have any
reasonable sequential analogs anyway, so there's no SC intuition that
needs preserving. And as a working concurrent programmer, I think that
the extra bit of concurrent readability CCCC allows hardware to
exploit is a feature, not a bug.  And I worry that if language-level
models do require SC, then compliance checking is going to be
problematic. It's not clear whether some processors can support it
without resorting to awful things like using CAS or LL/SC for each
load. And if processors are required to do so, then programmers will
routinely find some cheaper workaround hacks that are unlikely to be as
well-defined or correct as plain CCCC versions would be.

(Note also that any program using locks (that are in turn created
using CCCC-compliant primitives) is SC. It's only volatiles/atomics that
allow concurrent independent reads.)

**** CCCC model snapshot ****

NOTATION:

G = (S,-->) is a set of nodes S and a binary (directed) relation --> on
    the nodes. Nodes are of type:
      e ranges over read and write events,
      r over read events and
      w over write events.
G is a partial order if --> is transitive and irreflexive.
G is a total order if --> is a partial order and for any two
      distinct nodes a and b in S, either a --> b or b --> a.
G | T is graph G restricted to subset T; that is, the graph with nodes
      in T and with a --> b in T iff a --> b in S.
G* is the transitive closure of graph G.
'u' is set union

p(e) is the processor which executed event e,
E(p) is the set of events executed by processor p.
W(x) is the set of all write events on variable x
W    is the set of all write events (across all variables)
w(S) is set S together with all elements of W with an edge into or out of S.

RULES:

An execution is a graph G=(S, -->) over a set of events S satisfying:

[Write Serialization]
   For each variable x, G | W(x) is a total order, with minimal
   element w0 setting x to 0.

[Freshness]
   For every read r of x, min(W(x)) --> r and G | (W(x) u {r}) is a
   total order.

[Causal Propagation]
   e --> e'' if there is an e' in E(p(e)) u E(p(e'')) st e --> e' --> e'' .

[Local Consistency]
   For all processors i, (G | w(E(i))* is a partial order (i.e., is acyclic)
   Additionally, volatiles obey local program order:
     For all processors p, G | E(p) is a total order.

EXAMPLES:

1. Dekker (forbidden outcome)

Given

T1             T2
[1] x = 1      [3] y = 1
[2] r1 = y(0)  [4] r2 = x(0)

Program Order gives
   [1] --> [2], [3] --> [4],
Write Serialization and Freshness give
   [2] --> [3], [4] --> [1].
Propagation adds
   [3] --> [1], [1] --> [3] (and some other edges).

This graph is not an execution because (G | {[1],[3],[4]})* has a
cycle ([1] --> [3] --> [1])

======

2. IRIW (Independent reads of independent write -- allowed outcome)
    Note that this is not SC.

T1          T2         T3                T4

[1] x = 1   [2] y = 1

                       [3] r1 = x (1)    [5] r3 = y (1)
                       [4] r2 = y (0)    [6] r4 = x (0)

Program Order, Write Serialization, and Freshness give:
   [1] --> [3] --> [4] --> [2]
   [2] --> [5] --> [6] --> [1]
Propagation adds
   [2] --> [6], [5] --> [1] and [1] --> [4], [3] --> [2].

This graph satisfies all these conditions. (Note that [1] --> [2] is
in (G | W u E(T2))* and [2] --> [1] is in (G | W u E(T1)*), but both
these graphs are separately acyclic.)

=======

3. CC (Causal consistency -- forbidden outcome)

T1           T2               T3
[1] x = 1
             [2] r1 = x (1)   [4] r2 = y (1)
             [3] y = 1        [5] r3 = x (0)

Program Order, Write Serialization, and Freshness give:
  [1] --> [2] --> [3] --> [4] --> [5] --> [1].
This is not an execution because
  G | E u E(T3) = G | {[1],[3],[4],[5]}*
has cycle [1]-->[3]-->[4] --> [5] --> [1].

=======

4. DC (Another causal consistency example -- forbidden outcome)

T1           T2               T3
[1] y = 1
             [2] x = 1        [4] r3 = y (1)
             [3] r2 = y (0)   [5] r4 = x (0)

Program Order, Write Serialization, and Freshness give:
   [1] -> [4] --> [5] --> [2] --> [3] --> [1]
Propagation adds
   [2] --> [1] (among others).
This is not an execution because
   G | W u E(T3) = G | {[1],[2],[4],[5]}
has the cycle [1] --> [4] --> [5] --> [2] --> [1].