[cpp-threads] Review comments on N2177

Wed Apr 4 00:44:31 BST 2007

First, I want to thank Hans for his careful exposition of his position
on several atomic-variable topics.

Here are some responses, section by section.

						Thanx, Paul

------------------------------------------------------------------------

Simple rules for programmers

o	The lack of any real-world code that depends on sequential
	consistency should give us pause.  We cannot predict what
	tradeoffs hardware manufacturers will face 10 years from
	now, so it seems foolish to impose restrictions that are not
	firmly based on current practice.  However, this topic cannot
	be fully addressed in a single email.  More on this later.

o	The assertion that atomic variables should be used by
	a significant number of programmers assumes (I believe
	incorrectly) that most uses cannot be subsumed into
	higher-level APIs.  The productivity and understandability
	benefits of such higher-level APIs cannot be overstated.

Use of atomics

o	Simple counters implemented as split counters (either per-task
	or per-CPU, depending on the environment) can be implemented
	safely without any kind of synchronization, as noted in N2153.

	Global simple counters fall into the first class of defineable
	data races in my earlier email.  In other words, I believe that
	we can freely choose whether or not this use is defined.

o	I agree that garbage collection is helpful when using RCU-like
	atomic pointer updates.  The Linux-kernel implementation of
	RCU in fact supplies functionality that can be (very loosely)
	thought of as a garbage collector.  This functionality can
	be implemented at user level, and has in fact been prototyped
	at user level for benchmarking purposes.

	That said, as far as I know, there are no user-level
	implementations of RCU currently used in production.

	Please note that RCU is an example higher-level API.  Developers
	using this API need not use explicit memory barriers, because
	the needed barriers are subsumed into the primitives making
	up the RCU API.

Performance of Sequentially Consistent Atomics

o	Acquire/release semantics for atomics is OK as long as there
	is a load_relaxed()/store_relaxed() mechanism that can be used
	when needed.

o	I most certainly agree that the IRIW property is by far the most
	controversial.  ;-)

	I am not yet convinced of the perceived teachability benefits
	of IRIW.  My experience has been that most people intuitively
	understand that events that take place at about the same time but
	at different locations (e.g., different CPUs) might be difficult
	to sort into a globally agreed-on order.

	People who have dealt with time synchronization, either in
	clusters of computers or in real life, have an especially easy
	time grasping this point.  Yes, atomic clocks have existed for
	some decades, but most people don't have access to them.  As I
	type the period at the end of this sentence, some insomniac in
	Istanbul is typing the period at the end of some other sentence.
	Who is to say which period was typed first, and why should
	anyone care?

	To reiterate, given that there appear to be no real-world examples
	relying on IRIW, what business do we have imposing SC on future
	generations that are likely to face engineering tradeoffs that
	we cannot imagine, let alone plan for?

	One can argue that some concurrency tooling works better given
	SC, but the concurrency tooling used to validate hardware stands
	as a counter-example to this line of arguement.

o	I agree that many papers in the literature present only
	a memory-fence-free exposition of the algorithm.  However, this
	is not the same as a sequentially consistent exposition.  In many
	(perhaps all) cases, the algorithms would function on Strong CCCC
	(and perhaps also Weak CCCC) just as well as they would on SC.

o	It is important to note that there are locking primitives
	that avoid modifying any shared cache lines in the read-only case,
	for example, the case where modifications are being made only
	to data "owned" by the entity doing the modifications.	It is
	not clear that sequentially consistent atomics provide performance
	benefits in this case.  Similar observation applies to the
	strongly read-mostly case.

	For an example of such a locking primitive, see:

	http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TR-521.pdf

	A very similar primitive was called "brlock" in the 2.4
	Linux kernel, but was replaced by RCU in the 2.5 development
	process.

o	When you say that some machines provide a documented memory
	model that provides the IRIW guarantee for atomics at no
	additional cost, what base case are you comparing to?