[cpp-threads] Review comments on N2176

Tue Apr 3 03:46:25 BST 2007

Hello!

First, I want to thank Hans for his careful exposition of his position
on several memory-model topics.

Here are some responses, section by section.  I will provide feedback
on dependency ordering in a separate communication.

						Thanx, Paul

------------------------------------------------------------------------

Why Undefined Semantics for C++ Data Races?

o	Undefined semantics for C++ data races is not the status quo
	for many existing compilers, as much code can and does work
	on numerous architectures in face of data races.  An example
	of such code would be the split counters described in N2153.
	"Undefined" seems a bit strong for the status quo.

o	While I agree that providing explicit atomic variables is a
	good and valuable thing, data races are not necessarily bugs.
	Backwards-compatibility considerations will require a number
	of compilers to support data races in any case.

o	Error diagnostics from races would be good, but it is not
	necessary to leave data-race semantics undefined to achieve
	this.  In many C/C++ compilers it is possible to elicit
	warnings for code that is well defined, for example, in
	cases where the particular code sequence has been found
	to have high probability of indicating a bug.

o	Annotating races via atomic variables is indeed a good thing
	from the viewpoint of reducing false positives, but it is
	not necessary to leave data races undefined to obtain this
	benefit.  Programmers who choose not to use atomic variables
	simply don't gain the benefit of reductions in false positives,
	while programmers who do choose to use atomics do enjoy the
	possible reduction in false positives.

o	I agree that some data races must indeed result in undefined
	behavior, and I agree that proliferating fences is undesireable.
	However, this does not mean that all data races must remain
	undefined.

o	I agree that general multithreaded access can result in complex
	interactions that we do not want to fully describe in a standard.

o	The following example is easily resolved by requiring that
	variables be spilled into the stack frame.

		unsigned i = x;

		if (i < 2) {
		    foo: ...
		    switch (i) {
			case 0:
				...;
				break;
			case 1:
				...;
				break;
			default:
				...;
		    }
		}

	Had the programmer wanted to permit refetches of variable x,
	s/he could have instead written:

		if (x < 2) {
		    foo: ...
		    switch (x) {
			case 0:
				...;
				break;
			case 1:
				...;
				break;
			default:
				...;
		    }
		}

	The variables i and x are not the same, so it seems erroneous
	for the compiler to treat them as if they were in fact the
	same.

o	Perhaps it would be better to say that although data races can
	result in undefined behavior, existing compilers should offer
	backwards compatibility.  The safe data races I am aware of
	include:

	1.	Processees storing to a given aligned variable of
		basic type that fits into the implementation's machine
		word size, while other processes concurrently load
		from such a variable.  It is permissible for the set of
		processes loading and the set of processes storing to
		intersect, or even to coincide.  Each load must return
		either the initial value or the value supplied by one
		of the stores.

		In some cases, this will require registers to be
		spilled to a temporary.

	2.	More complex interactions involving use of non-standard
		extensions.  In all of these cases, there is a
		corresponding extension being proposed, so it makes
		sense to leave these cases implementation-defined.

Why atomics have integrated ordering constraints

o	It would be more accurate to say that the atomic-operations
	implementation in the Linux kernel -permits- use of explicit
	memory fences.

	In particular, it is important to note that the Linux-kernel
	Documentation/atomic_ops.txt file is written not for the
	-user- of the Linux kernel's atomic operations, but rather
	for developers -implementing- the Linux kernel's atomic
	operations on a new hardware platform.  For example, when
	this document says that the atomic_inc_return() and
	atomic_dec_return() primitives "require[d] that explicit
	memory barriers are performed before and after the atomic
	oparation calls", it does not mean that the -user- of
	these primitives must insert explicit memory barriers,
	but rather that the -implementors- must include such
	barriers inside these primitives' implementations.  In other
	words, in the following code fragment:

		x = 1;
		y = atomic_inc_return(&a);
		z = 1;

	the implementor of atomic_inc_return() must include whatever
	memory barriers might be required by the hardware platform in
	question to make it appear to other CPUs that the assignment
	to x occurred first, followed by the atomic increment, followed
	by the assignment to z.

	To reiterate, the -user- of atomic_inc_return() need -not-
	supply explicit memory barrier primitives.

	In contrast, the user of atomic_inc() and atomic_dec() might
	or might not need to supply explicit memory barriers, depending
	on that user's algorithm.

	So the Linux kernel's atomic-operation implementation is
	not necessarily inconsistent with an atomic_ops package.

o	That aside, I agree that providing variables that imply
	ordering can be useful, at least when load_relaxed() and
	store_relaxed() are present.  My only argument (as in N2153)
	is that explicit fences be provided as well.

o	Note that explicit fences can be optimized away in the
	single-threaded case, at least outside of device-driver
	and architecture-dependent code.  The Linux kernel in
	fact provides the smp_mb(), smp_rmb(), and smp_wmb()
	primitives for this exact purpose.  In this case, the
	developer has signalled that the barriers are needed only
	in multithreaded (or SMP in Linux-kernel parlance) builds
	of the kernel.

o	I disagree strongly with the implication that naive
	programmers do not expect dependency-based ordering.
	More on this in a separate communication.

o	So you are arguing for sequential consistency, for which there
	are no known real-world use cases, while arguing against
	dependency-based ordering, for which there are a very large
	number.  This seems to me to be an extremely strange juxtaposition
	of arguments.

	I understand that Java has specified SC.  However:

	o	Some of the people involved in that discussion have
		publicly argued that it was ill-advised.

	o	A difference between the models seems harmless if
		there are no real-world programs that care.

Should atomic operations be ordered based on dependencies?

o	This section needs a more-detailed response, which I will
	provide in a separate communication.

	In the meantime, I very much thank the author for carefully
	documenting his reasoning.  This was extremely helpful!

Why ordering constraints are never limited to loads or stores

o	The analysis in the "Recipient writes to object" section
	implicitly assumes that writes may be performed speculatively.
	This is not the case in POWER, and I was under that impression
	that it is also not the case in Itanium.  (And I believe that
	all architectures that permit speculative writes have stronger
	memory models that would prohibit the load-store reordering,
	thus avoiding the posited problem.)

	If I am mistaken, please let me know so that I can move the
	rcu_assign_pointer() and/or the rcu_dereference() primitive from
	core kernel code to architecture-dependent code.  Doing this
	would allow Itanium (and other architectures) to use stronger
	memory ordering primitives.

o	Another important case omitted from this analysis is when
	the recipient performs an atomic operation on x.a.  In this
	case a store-store barrier on the part of the sender suffices.

Treatment of out-of-thin-air values

o	You argue that infeasibility of dependency ordering motivates
	permitting out-of-thin-air results.  Suppose that there was
	a reasonable way of permitting dependency ordering.  Would
	that affect your view on out-of-thin-air results?