[cpp-threads] Review comments on N2176 WRT dependency ordering

Sun Apr 8 13:18:11 BST 2007

Hello again!

I once again thank Hans for his careful description of a number of
interesting optimization situations relating to dependency-based
ordering.  The following text discusses some possible approaches
to resolving these situations.

Thoughts?

						Thanx, Paul

------------------------------------------------------------------------

Dependency Ordering

In N2176, it is argued that the compiler should be free to violate
ordering implied by data and control dependencies, presenting a
number of examples of such dependencies that are held to result in
problematic restrictions on optimizations.  We disagree, as the highly
counter-intuitive nature of these optimizations warrant extreme caution.
We should instead look for ways of supporting them that do not unduly
restrict the set of permissible optimizations.  That said, adding
something as profound as multi-threading should be expected to impose
at least a few restrictions on compiler developers.  After all, doing
so has placed restrictions on both software and hardware developers.

The purpose of this note is to present a wider range of alternatives
than have thus far been considered for preservation of dependency-based
ordering.  It is hoped that this will foster discussion that will
generate an acceptable proposal.

Dependency-Based-Ordering Alternatives

Here are a few alternatives for handling dependency-based ordering:

1.	Allow compiler optimizations to break dependency-based ordering
	willy-nilly, with no reasonable means to disable this behavior.
	This alternative appears to be preferred by the authors of N2176.

2.	Allow the compiler to break dependencies when aggressive use of
	static optimizations would naturally do so (“dynamic
	dependencies”, attributed by the authors of N2176 to Peter
	Dimov).

3.	Allow the compiler to break dependency chains that are not headed
	by a load_relaxed() member function.

4.	Prohibit compiler optimizations that break dependency-based
	ordering (either based on static dependency analysis in #1
	above or based on dynamic dependency analysis in #2 above).
	The authors of N2176 appear to believe that either of these two
	approaches would be too restrictive for compiler optimizers.

5.	Create a list of dependency-breaking optimizations that are
	permitted, and provide an “unsafe optimization option”
	that permits additional dependency-breaking optimizations when
	explicitly specified.  There have been a few verbal discussions
	on this topic, but it would also be necessary to correctly
	classify future (perhaps proprietary) optimizations, which might
	be difficult given the current state of the art.

6.	Use an existing compiler flag such as -lpthread to disable all
	dependency-breaking optimizations (again, either based on
	static dependency analysis in #1 above or based on dynamic
	dependency analysis in #2 above).  Multiple compilations of
	library functions could be provided if needed, as in fact is
	done now on some platforms providing pthreads support.

7.	Use C++ name-mangling techniques to provide for each externally
	visible function a variant that breaks dependencies and a
	variant that preserves all dependencies.  This would not be a
	problem on large-memory machines, and most small-memory machines
	are single-processor and hence could break dependencies freely.
	The usual profile-based optimizations could be used to group
	functions in order to avoid instruction-cache pressure, and a
	dynamic-linking step could be added to eject unused variants
	from the address space if need be (of course, subsequent dynamic
	linking might need to pull those variants back into the address
	space).

8.	Allow the optimizer to break dependencies except when explicitly
	told not to.  Potential ways of telling the compiler not to break
	dependencies include:

	a)	Beginning a dependency chain with a load_relaxed() member
		function (as in #3 above).

	b)	Beginning a dependency chain with a load from an atomic
		variable (though this does not appear to be very useful
		given that accesses to atomic variables are ordered
		in any case).

	c)	Using a gcc-style attribute to indicate that dependencies
		must be maintained for a given formal parameter of
		a given function, perhaps similar in style to gcc's
		__nonnull__ attribute (perhaps named something like
		__preserve_dependencies__).  For extern functions
		in programs having multiple compilation units, the
		__preserve_dependencies__ attribute must be present in
		both the declaration and the definition of the function,
		so that this attribute is visible to the compiler in
		all relevant compilation units.  For example:

		In a header file:

		extern long
		lookup(struct pl base,
		       long key __attribute((__preserve_dependencies__ (1, 2)));

		In the .c file:

		extern long
		lookup(struct pl base,
		       long key __attribute((__preserve_dependencies__ (1, 2)))
		{
		       return base[key];
		}

		Alternatively, the __preserves_dependencies__ attribute
		could be applied to the entire function rather than
		argument by argument:

		extern long
		lookup(struct pl base, long key)
			__attribute((__preserve_dependencies__));

		A similar syntax would be applied to the function definition.

	d)	Define a new storage class to indicate that dependencies
		be preserved.  This might be more controversial than an
		attribute, but might have the advantage of permitting
		the C/C++ type system to check proper usage.

	e)	Define template classes that provide compiler-specific
		directives indicating that ordering be preserved.  If
		this approach turns out to be feasible, it would permit
		type-checking without the intrusiveness that adding a
		storage class would represent.

	f)	Selected library functions will need dependency-preservation
		attributes, for example, on machines where 64-bit multiply
		and divide are performed by intrinsic library functions,
		these library functions would need variants so marked.

	Tools could then easily detect when dependencies had been broken.
	Cast-like structures or special intrinsic functions could mark a
	point in a dependency chain beyond which breakage is permissible.
	Marking the end of a dependency chain will be desirable should
	compiler writers substitute memory barriers in cases where they
	really badly wish to break the dependency chain (though such
	optimizations should be easily disabled, as they would be likely
	to do more harm than good).

9.	Like #8 above,  but require that the head of a dependency chain
	that was to be protected from the optimizer be marked with either
	an intrinsic, a cast-like structure, or a gcc-like attribute.

	r1 = preserve_dependencies__(x.load_relaxed);
	r2 = *r1;

	Doug Lea has proposed a perObjectPostLoadFence() intrinsic that
	could used as well:

	r1 = x;
	Fences.perObjectPostLoadFence(r1);
	r2 = *r1;

Of these, #8 and #9 seem to be the most promising.  #8 is used on the
examples in the following section so that the code fragments most closely
resembles their counterparts in N2176.

Dependency-Ordering Examples From N2176

N2176's first example presents simple ordering of a dereference operation:

r1 = x.load_relaxed();
r2 = *r1;

Here the dependency chain begins with a load_relaxed() member function and
is entirely contained within a single compilation unit, so is maintained.

N2176's second example generates an artificial data dependency in order
to force ordering:

r1 = x.load_relaxed();
r3 = &a + r1 - r1;
r2 = *r3;

N2176 notes that standard optimizations would result in the following,
which would violate the dependency:

r1 = x.load_relaxed();
r3 = &a;
r2 = *r3;

However, the fact that this dependency chain begins with a load_relaxed()
member function prohibits this optimization in this case.  Contrast this
with the following, where “a” is a non-atomic variable:

r1 = x.load_relaxed();
r3 = &a + r1 – r1 + a - a;
r2 = *r3;

The compiler would be permitted to optimize away the “a-a”, but
not the “r1-r1”.  If the programmer wishes the “r1-r1” to be
optimized away, the aforementioned intrinsic that marks the end of a
dependency chain could be provided.

N2176's third example shows that an innocent-seeming transformation
might convert a dependency chain that would be recognized by a given
system into a form that might not be:

r1 = x.load_relaxed();
if (r1 == 0)
	r2 = *r1;
else
	r2 = *(r1 + 1);

The innocent transformation might result in the following:

r1 = x.load_relaxed();
if (r1 == 0)
	r3 = r1;
else
	r3 = r1 + 1;
r2 = *r3;

The authors of N2176 do not specify what hardware recognizes the first
but not the second dependency chain, but assuming that there is such
hardware, either the optimization must be prohibited or the backend for
the offending machine must emit explicit memory-barrier instructions
to enforce the needed ordering.  This decision is left in the hands
of the compiler writers, who would presumably consider how common the
offending hardware was and how useful the optimization in question was.
If desired, the manufacturer of the offending system could provide tools
or documentation to help locate sections of code that exceed the system's
ability to track dependencies.

N2176's fourth example concerns duplicate code, as might be produced by
inline functions or C-preprocessor macros:

r1 = x.load_relaxed();
if (r1) {
	r2 = y.a;
} else {
	r2 = y.a;
}

In absence of the load_relaxed() member function, compiler might
reasonably collapse the above as follows, losing the dependency:

r1 = x.load_relaxed();
r2 = y.a;

The load_relaxed() member function prohibits this transformation, or,
alternatively, causes the backend to emit an explicit memory barrier in
order to enforce the ordering.

N2176's fifth example also concerns duplicate code, but where the
dependency chain spans compilation-unit boundaries:

r1 = x.load_relaxed();
if (r1) {
	f(&y);
} else {
	g(&y);
}

In this case, the dependency chain will be preserved only if
the declarations and definitions for f() and/or g() specify a
dependency-preserving attribute, in which case the compiler will know to
preserve dependencies when compiling and when invoking f() and/or g().
This same analysis applies to the examples given in the “why this
seriously breaks optimizations” section in N2176.

N2176's sixth example concerns optimizations explicitly for the purpose
of breaking dependency chains:

r2 = x.load_relaxed();
r3 = r2 -> a;

However, the fact that the dependency chain begins with a load_relaxed()
member function prohibits the problematic optimization to the following:

r2 = x.load_relaxed();
r3 = r1 -> a;
if (r1 != r2) r3 = r2 -> a;

As before, tagging functions declarations and definitions with
dependency-preserving attributes provides the compiler the information
that it needs to correctly compile dependency chains that span
compilation-unit boundaries.

N2176's seventh example concerns accesses to single-element arrays:

r1 = x.load_relaxed();
r2 = a[r1 -> index % a_size];

If a_size is known to the compiler to be zero, this could be optimized
to the following, destroying the dependency chain:

r1 = x.load_relaxed();
r2 = a[0];

However, the load_relaxed() member function would prohibit this
optimization (or, alternatively, require that an explicit memory barrier
be supplied between the two optimized statements).