[cpp-threads] Review comments on N2176 WRT dependency ordering
Paul E. McKenney
paulmck at linux.vnet.ibm.com
Sun Apr 8 13:18:11 BST 2007
Hello again!
I once again thank Hans for his careful description of a number of
interesting optimization situations relating to dependency-based
ordering. The following text discusses some possible approaches
to resolving these situations.
Thoughts?
Thanx, Paul
------------------------------------------------------------------------
Dependency Ordering
In N2176, it is argued that the compiler should be free to violate
ordering implied by data and control dependencies, presenting a
number of examples of such dependencies that are held to result in
problematic restrictions on optimizations. We disagree, as the highly
counter-intuitive nature of these optimizations warrant extreme caution.
We should instead look for ways of supporting them that do not unduly
restrict the set of permissible optimizations. That said, adding
something as profound as multi-threading should be expected to impose
at least a few restrictions on compiler developers. After all, doing
so has placed restrictions on both software and hardware developers.
The purpose of this note is to present a wider range of alternatives
than have thus far been considered for preservation of dependency-based
ordering. It is hoped that this will foster discussion that will
generate an acceptable proposal.
Dependency-Based-Ordering Alternatives
Here are a few alternatives for handling dependency-based ordering:
1. Allow compiler optimizations to break dependency-based ordering
willy-nilly, with no reasonable means to disable this behavior.
This alternative appears to be preferred by the authors of N2176.
2. Allow the compiler to break dependencies when aggressive use of
static optimizations would naturally do so (“dynamic
dependencies”, attributed by the authors of N2176 to Peter
Dimov).
3. Allow the compiler to break dependency chains that are not headed
by a load_relaxed() member function.
4. Prohibit compiler optimizations that break dependency-based
ordering (either based on static dependency analysis in #1
above or based on dynamic dependency analysis in #2 above).
The authors of N2176 appear to believe that either of these two
approaches would be too restrictive for compiler optimizers.
5. Create a list of dependency-breaking optimizations that are
permitted, and provide an “unsafe optimization option”
that permits additional dependency-breaking optimizations when
explicitly specified. There have been a few verbal discussions
on this topic, but it would also be necessary to correctly
classify future (perhaps proprietary) optimizations, which might
be difficult given the current state of the art.
6. Use an existing compiler flag such as -lpthread to disable all
dependency-breaking optimizations (again, either based on
static dependency analysis in #1 above or based on dynamic
dependency analysis in #2 above). Multiple compilations of
library functions could be provided if needed, as in fact is
done now on some platforms providing pthreads support.
7. Use C++ name-mangling techniques to provide for each externally
visible function a variant that breaks dependencies and a
variant that preserves all dependencies. This would not be a
problem on large-memory machines, and most small-memory machines
are single-processor and hence could break dependencies freely.
The usual profile-based optimizations could be used to group
functions in order to avoid instruction-cache pressure, and a
dynamic-linking step could be added to eject unused variants
from the address space if need be (of course, subsequent dynamic
linking might need to pull those variants back into the address
space).
8. Allow the optimizer to break dependencies except when explicitly
told not to. Potential ways of telling the compiler not to break
dependencies include:
a) Beginning a dependency chain with a load_relaxed() member
function (as in #3 above).
b) Beginning a dependency chain with a load from an atomic
variable (though this does not appear to be very useful
given that accesses to atomic variables are ordered
in any case).
c) Using a gcc-style attribute to indicate that dependencies
must be maintained for a given formal parameter of
a given function, perhaps similar in style to gcc's
__nonnull__ attribute (perhaps named something like
__preserve_dependencies__). For extern functions
in programs having multiple compilation units, the
__preserve_dependencies__ attribute must be present in
both the declaration and the definition of the function,
so that this attribute is visible to the compiler in
all relevant compilation units. For example:
In a header file:
extern long
lookup(struct pl base,
long key __attribute((__preserve_dependencies__ (1, 2)));
In the .c file:
extern long
lookup(struct pl base,
long key __attribute((__preserve_dependencies__ (1, 2)))
{
return base[key];
}
Alternatively, the __preserves_dependencies__ attribute
could be applied to the entire function rather than
argument by argument:
extern long
lookup(struct pl base, long key)
__attribute((__preserve_dependencies__));
A similar syntax would be applied to the function definition.
d) Define a new storage class to indicate that dependencies
be preserved. This might be more controversial than an
attribute, but might have the advantage of permitting
the C/C++ type system to check proper usage.
e) Define template classes that provide compiler-specific
directives indicating that ordering be preserved. If
this approach turns out to be feasible, it would permit
type-checking without the intrusiveness that adding a
storage class would represent.
f) Selected library functions will need dependency-preservation
attributes, for example, on machines where 64-bit multiply
and divide are performed by intrinsic library functions,
these library functions would need variants so marked.
Tools could then easily detect when dependencies had been broken.
Cast-like structures or special intrinsic functions could mark a
point in a dependency chain beyond which breakage is permissible.
Marking the end of a dependency chain will be desirable should
compiler writers substitute memory barriers in cases where they
really badly wish to break the dependency chain (though such
optimizations should be easily disabled, as they would be likely
to do more harm than good).
9. Like #8 above, but require that the head of a dependency chain
that was to be protected from the optimizer be marked with either
an intrinsic, a cast-like structure, or a gcc-like attribute.
r1 = preserve_dependencies__(x.load_relaxed);
r2 = *r1;
Doug Lea has proposed a perObjectPostLoadFence() intrinsic that
could used as well:
r1 = x;
Fences.perObjectPostLoadFence(r1);
r2 = *r1;
Of these, #8 and #9 seem to be the most promising. #8 is used on the
examples in the following section so that the code fragments most closely
resembles their counterparts in N2176.
Dependency-Ordering Examples From N2176
N2176's first example presents simple ordering of a dereference operation:
r1 = x.load_relaxed();
r2 = *r1;
Here the dependency chain begins with a load_relaxed() member function and
is entirely contained within a single compilation unit, so is maintained.
N2176's second example generates an artificial data dependency in order
to force ordering:
r1 = x.load_relaxed();
r3 = &a + r1 - r1;
r2 = *r3;
N2176 notes that standard optimizations would result in the following,
which would violate the dependency:
r1 = x.load_relaxed();
r3 = &a;
r2 = *r3;
However, the fact that this dependency chain begins with a load_relaxed()
member function prohibits this optimization in this case. Contrast this
with the following, where “a” is a non-atomic variable:
r1 = x.load_relaxed();
r3 = &a + r1 – r1 + a - a;
r2 = *r3;
The compiler would be permitted to optimize away the “a-a”, but
not the “r1-r1”. If the programmer wishes the “r1-r1” to be
optimized away, the aforementioned intrinsic that marks the end of a
dependency chain could be provided.
N2176's third example shows that an innocent-seeming transformation
might convert a dependency chain that would be recognized by a given
system into a form that might not be:
r1 = x.load_relaxed();
if (r1 == 0)
r2 = *r1;
else
r2 = *(r1 + 1);
The innocent transformation might result in the following:
r1 = x.load_relaxed();
if (r1 == 0)
r3 = r1;
else
r3 = r1 + 1;
r2 = *r3;
The authors of N2176 do not specify what hardware recognizes the first
but not the second dependency chain, but assuming that there is such
hardware, either the optimization must be prohibited or the backend for
the offending machine must emit explicit memory-barrier instructions
to enforce the needed ordering. This decision is left in the hands
of the compiler writers, who would presumably consider how common the
offending hardware was and how useful the optimization in question was.
If desired, the manufacturer of the offending system could provide tools
or documentation to help locate sections of code that exceed the system's
ability to track dependencies.
N2176's fourth example concerns duplicate code, as might be produced by
inline functions or C-preprocessor macros:
r1 = x.load_relaxed();
if (r1) {
r2 = y.a;
} else {
r2 = y.a;
}
In absence of the load_relaxed() member function, compiler might
reasonably collapse the above as follows, losing the dependency:
r1 = x.load_relaxed();
r2 = y.a;
The load_relaxed() member function prohibits this transformation, or,
alternatively, causes the backend to emit an explicit memory barrier in
order to enforce the ordering.
N2176's fifth example also concerns duplicate code, but where the
dependency chain spans compilation-unit boundaries:
r1 = x.load_relaxed();
if (r1) {
f(&y);
} else {
g(&y);
}
In this case, the dependency chain will be preserved only if
the declarations and definitions for f() and/or g() specify a
dependency-preserving attribute, in which case the compiler will know to
preserve dependencies when compiling and when invoking f() and/or g().
This same analysis applies to the examples given in the “why this
seriously breaks optimizations” section in N2176.
N2176's sixth example concerns optimizations explicitly for the purpose
of breaking dependency chains:
r2 = x.load_relaxed();
r3 = r2 -> a;
However, the fact that the dependency chain begins with a load_relaxed()
member function prohibits the problematic optimization to the following:
r2 = x.load_relaxed();
r3 = r1 -> a;
if (r1 != r2) r3 = r2 -> a;
As before, tagging functions declarations and definitions with
dependency-preserving attributes provides the compiler the information
that it needs to correctly compile dependency chains that span
compilation-unit boundaries.
N2176's seventh example concerns accesses to single-element arrays:
r1 = x.load_relaxed();
r2 = a[r1 -> index % a_size];
If a_size is known to the compiler to be zero, this could be optimized
to the following, destroying the dependency chain:
r1 = x.load_relaxed();
r2 = a[0];
However, the load_relaxed() member function would prohibit this
optimization (or, alternatively, require that an explicit memory barrier
be supplied between the two optimized statements).
More information about the cpp-threads
mailing list