[cpp-threads] Review comments on N2177
Paul E. McKenney
paulmck at linux.vnet.ibm.com
Wed Apr 4 00:44:31 BST 2007
First, I want to thank Hans for his careful exposition of his position
on several atomic-variable topics.
Here are some responses, section by section.
Thanx, Paul
------------------------------------------------------------------------
Simple rules for programmers
o The lack of any real-world code that depends on sequential
consistency should give us pause. We cannot predict what
tradeoffs hardware manufacturers will face 10 years from
now, so it seems foolish to impose restrictions that are not
firmly based on current practice. However, this topic cannot
be fully addressed in a single email. More on this later.
o The assertion that atomic variables should be used by
a significant number of programmers assumes (I believe
incorrectly) that most uses cannot be subsumed into
higher-level APIs. The productivity and understandability
benefits of such higher-level APIs cannot be overstated.
Use of atomics
o Simple counters implemented as split counters (either per-task
or per-CPU, depending on the environment) can be implemented
safely without any kind of synchronization, as noted in N2153.
Global simple counters fall into the first class of defineable
data races in my earlier email. In other words, I believe that
we can freely choose whether or not this use is defined.
o I agree that garbage collection is helpful when using RCU-like
atomic pointer updates. The Linux-kernel implementation of
RCU in fact supplies functionality that can be (very loosely)
thought of as a garbage collector. This functionality can
be implemented at user level, and has in fact been prototyped
at user level for benchmarking purposes.
That said, as far as I know, there are no user-level
implementations of RCU currently used in production.
Please note that RCU is an example higher-level API. Developers
using this API need not use explicit memory barriers, because
the needed barriers are subsumed into the primitives making
up the RCU API.
Performance of Sequentially Consistent Atomics
o Acquire/release semantics for atomics is OK as long as there
is a load_relaxed()/store_relaxed() mechanism that can be used
when needed.
o I most certainly agree that the IRIW property is by far the most
controversial. ;-)
I am not yet convinced of the perceived teachability benefits
of IRIW. My experience has been that most people intuitively
understand that events that take place at about the same time but
at different locations (e.g., different CPUs) might be difficult
to sort into a globally agreed-on order.
People who have dealt with time synchronization, either in
clusters of computers or in real life, have an especially easy
time grasping this point. Yes, atomic clocks have existed for
some decades, but most people don't have access to them. As I
type the period at the end of this sentence, some insomniac in
Istanbul is typing the period at the end of some other sentence.
Who is to say which period was typed first, and why should
anyone care?
To reiterate, given that there appear to be no real-world examples
relying on IRIW, what business do we have imposing SC on future
generations that are likely to face engineering tradeoffs that
we cannot imagine, let alone plan for?
One can argue that some concurrency tooling works better given
SC, but the concurrency tooling used to validate hardware stands
as a counter-example to this line of arguement.
o I agree that many papers in the literature present only
a memory-fence-free exposition of the algorithm. However, this
is not the same as a sequentially consistent exposition. In many
(perhaps all) cases, the algorithms would function on Strong CCCC
(and perhaps also Weak CCCC) just as well as they would on SC.
o It is important to note that there are locking primitives
that avoid modifying any shared cache lines in the read-only case,
for example, the case where modifications are being made only
to data "owned" by the entity doing the modifications. It is
not clear that sequentially consistent atomics provide performance
benefits in this case. Similar observation applies to the
strongly read-mostly case.
For an example of such a locking primitive, see:
http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TR-521.pdf
A very similar primitive was called "brlock" in the 2.4
Linux kernel, but was replaced by RCU in the 2.5 development
process.
o When you say that some machines provide a documented memory
model that provides the IRIW guarantee for atomics at no
additional cost, what base case are you comparing to?
More information about the cpp-threads
mailing list