[cpp-threads] Belated comments on dependency-based orderingproposal

Wed Sep 19 18:31:49 BST 2007

On Tue, Sep 18, 2007 at 09:46:10PM -0000, Boehm, Hans wrote:
> > From: Paul E. McKenney
> > Sent: Friday, September 14, 2007 10:19 PM
> > On Fri, Sep 14, 2007 at 09:02:24PM -0700, Hans Boehm wrote:
> > > You're describing what I thought we were going to do.  But N2360
> > > states:
> > > 
> > > "Next, we define "dependency tree" (or, perhaps more accurately, 
> > > attempt to do so):
> > > 
> > > An evaluation H that performs a dependency operation on an object M 
> > > heads a dependency tree, and is also a member of that 
> > dependency tree. 
> > > Any subsequent evaluation that consumes the result of a prior 
> > > evaluation that is a member of H's dependency tree is also 
> > a member of 
> > > H's dependency tree, unless that prior evaluation is an unannotated 
> > > formal parameter or is a return value from a function whose 
> > function 
> > > prototype specifies an unannotated return value. [Note: dependency 
> > > annotations are described in N2361.] [Note: a given 
> > evaluation can be 
> > > a member of multiple dependency trees.]"
> > > 
> > > I read this is saying that dependencies don't flow into unannotated 
> > > function calls.  Perhaps Paul can explain the intent?
> > 
> > I was indeed drawing the line at function calls in this document.
> > You are suggesting that the line instead be drawn at function 
> > calls crossing compilation units?  It might be possible to 
> > talk me into that -- my main concern was that this could get 
> > sloppy if compilers get more aggressive about whole-program 
> > compilation.
> > 
> > Thoughts?
>
> I think the most fundamental differences here are:
> 
> 1) Do dependency trees extend past the return of the function that
> performed the original atomic load if, for example, a dependent value is
> stored in the heap or returned.  I'd be inclined to say no.

Agreed, at least in absence of annotations as described in N2361.
Even with N2361, if dependency trees are to extend past the return
of the function that performed the original atomic load, then that
function's return type must be annotated, both in the definition and in
all declarations.

> 2) What happens if the dependency tree extends into code regions the
> compiler can't see?  If you assume the answer to the previous question
> is no, then I think there are two plausible answers:
> 
> 	A) The dependency tree is truncated there.  I think this is the
> N2360 approach.

Agreed, N2361 annotations would be required to extend the dependency
tree into some other compilation unit, and these annotations would
have to be visible from all relevant compilation units.

> 	B) The compiler generates an acquire load to preserve the
> ordering anyway.   I think Lawrence and I are arguing for this version.

Ah!  I was still thinking of this as a trivial implementation as opposed
to a component of an alternative implementation strategy.  The idea is
as follows, correct?

o	If no N2361 annotations, the compiler must emit an appropriate
	memory fences when control leaves the compilation unit.  The code
	in the other compilation unit will therefore work correctly even
	in presence of local dependency-breaking optimizations.

o	If there are N2361 annotations, then the compiler can allow
	the dependency chain to cross compilation-unit boundaries
	via the annotated function arguments and return values.

o	There would then need to be some way to kill a dependency chain
	in order to avoid gratuitous memory fences.  Your
	ignore_dependency() template below would be one approach.
	Another approach would be another set of annotations for
	function arguments and return values.  The ignore_dependency()
	approach seems to cover all the possibilities, so is where I
	believe that we should start.

> Approach (B) has the huge advantage that you don't have to know whether
> the operator[] you're using is built-in or a library call.  They still
> behave the same.  The semantics are are moderately clear.  The only
> surprise factor is the possible introduction of unexpected fences.  But
> that could be handled with warnings on platforms for which it matters.

Makes sense.

> I think approach (B) doesn't work if the answer to the first question is
> yes, since a lot of function results will depend on a load performed in
> the function.  But that's probably OK.  We may eventually need an
> annotation stating that a function call should be treated as though it
> were a dependency-ordered load, so that it becomes possible to do such
> loads in function calls, rather than directly or in macros.

Yep, in that case, we would use N2361 annotations, which would allow
the compiler to keep these cases separate.

> In either case, we may want
> 
> template<class T> T ignore_dependency(T x) { return (x); }
> 
> which behaves like the identity function, but is guaranteed not to
> propagate dependencies.

And this would need explicit compiler support -- the kill_dependency_chain()
called out in N2361 relied on lack of annotations to make this work.

So, is it better to have a special template class that has special
semantics, or should we instead define a [[dependence_ignore]] attribute
from which ignore_dependency() can be constructed:

template<class T> [[dependence_ignore]] T ignore_dependency(T x) { return (x); }

Then we have the following cases:

o	An unannotated argument that is a member of a dependency chain
	causes the compiler to emit a memory fence.

o	An argument annotated with N2361 [[dependence_propagate]]
	causes the compiler to extend the dependency chain across
	a compilation-unit boundary.

o	An argument annotated with a new N2361 [[dependence_ignore]]
	would neither emit a memory fence nor extend the dependency
	chain across a compilation-unit boundary.

o	An unannotated return value that is a member of a dependency
	chain would neither emit a memory fence nor extend the dependency
	chain across a compilation-unit boundary.  Note that the default
	is different than for arguments -- in the default case discussed
	above, the naively-written library function is protected by the
	caller, so the function itself need do nothing upon return.

	However, the other naive case is where the naive function calls
	another function that explicitly pushes the dependency chain
	out through its return value.  This could happen in cases where
	there are wrapper functions that are invoked due to implicit
	conversions (these still can happen in C++, right?).  In this
	case, the called function would have annotated its return value,
	and it seems to me that the compiler would have to distinguish
	between these two cases, terminating the dependency chain if
	the head of the chain is an atomic load that is lexically within
	the function, and propagating it (via explicit memory fence if
	need be) if the local head of the chain is instead an annotated
	return value from a called function.

o	A return value annotated with N2361 [[dependency_propagate]]
	causes the compiler to extend the dependency chain across
	a compilation-unit boundary.

o	A return value annotated with a new N2361 [[dependence_ignore]]
	would neither emit a memory fence nor extend the dependency
	chain across a compilation-unit boundary.  This is the same
	as the default.

o	If the head of the dependency chain is an atomic load that is
	lexically contained within the function in question, and if
	the programmer wants to maintain ordering, but not to propagate
	the dependency chain, then the programmer should insert an
	explicit memory fence before the return.  (Or do we want
	another annotation that causes the compiler to implicitly
	place the memory fences, perhaps omitting it on code paths
	that have other memory fences already in place?)

Does this approach seem reasonable?

> > > > >> I thought we had tentatively concluded in Toronto that all
> > > "data
> > > >> dependent" loads occurring as a result of the execution of f() 
> > > >> should be in the dependency tree anyway, no matter whether they 
> > > >> occurred as a result of a nested function call or not.  If the 
> > > >> value flows into a separately compiled function, the 
> > compiler has 
> > > >> to back off and treat memory_order_dependency as 
> > > >> memory_order_acquire.  This may add unexpected overhead in some 
> > > >> cases.  But examples like the above will be guaranteed 
> > to work as expected.
> > > >>
> > > >> But I think this also isn't reflected here?
> > > >>
> > > >> My guess is that the N2360 approach (if I understood it 
> > correctly) 
> > > >> might be viable for C, where function calls are obvious in the 
> > > >> source.  But too many things in C++ are a function call 
> > behind the 
> > > >> scenes.  I'm not very comfortable with an approach that 
> > treats v[i] 
> > > >> differently depending on whether v is a C int[10] or a 
> > C++ std::array<int, 10>.
> > > >
> > > > If the std::array index is inlined, the compiler has full 
> > knowledge 
> > > > of the dependences, and one gets a performance win.  If 
> > the index is 
> > > > not inlined, the performance drops.  But that is true 
> > anyway.  I am 
> > > > willing to live with a bit of performance uncertainty in 
> > C++ codes, 
> > > > in part because we already have it.  More importantly, I 
> > think, is 
> > > > that programmers are unlikely to go to dependence-based ordering 
> > > > without serious experimental evidence and lots of other tuning, 
> > > > which may very well include careful management of the inlining 
> > > > process.
> > > Again, I thought that was the plan, but I don't read the current 
> > > proposal that way.
> > > 
> > > >
> > > >> I think we'll encounter many more problems along these 
> > lines, e.g 
> > > >> also with loads in destructors.
> > > >
> > > > Can you elaborate?
> > > >
> > > >> I suspect we need more work defining what it means to 
> > "consume the 
> > > >> result of a prior evaluation".  But someone from the C++ 
> > standards 
> > > >> committee core group will have to help there.
> > > >
> > > > Hm.  The pragmatic meaning is clear, which is that the 
> > C++ lowered 
> > > > to C uses the value.  The standard definition is probably 
> > a bit fuzzier.
> > >
> > > There are no doubt borderline cases, especially with 
> > library functions.
> > > Does strlen propagate a dependency from one of the characters?
> > > Does a long long comparison depend on all parts of the 
> > numbers even if 
> > > the high halves are different?
> > 
> > My approach on strlen() would be to say "No, because strlen 
> > has neither its arguments nor its return value annotated."  
> > If one open-coded strlen() inline, then yes, the dependency 
> > would propagate through the characters
> > -- but for this to make any sense, there would have to have 
> > been a dependency chain (with its head explicitly marked) 
> > that propagated to one of the characters, for example (taking 
> > care to avoid the Vector
> > class):
> > 
> > 	*(cp + 1) = '\0';
> > 	i = 0;
> > 	r1 = x.load(memory_order_dependency);
> > 	*cp = r1;
> > 	for (cp1 = cp; *cp1 != '\0'; cp1++, i++)
> > 		continue;
> > 	r2 = *(y + i);
> > 
> > The load into r2 will be ordered after the load into r1 by 
> > dependency ordering.
>
> But the dependency chain would still involve a control dependency,
> right?  Thus under the current proposal, there would be no ordering?

There is a data dependency independent of the path that the for-loop
traverses, because the for-loop accesses *cp unconditionally.  Thus,
ordering does not depend on control flow in this case.

> We clearly have to revisit this with approach (B) above.  If we stick
> with the data dependency-only approach, my take on it would be to say
> that only explicitly listed library functions (incl. most or all of the
> operators) propagate dependencies.  In most other cases, we really have
> no idea whether the library implementation uses data or control
> dependencies.

The above example does not support your point, but there might well 
be examples that do support your point.  For example, making a slight
modification to the above example:

	*(cp + 2) = '\0';
	i = 0;
	r1 = x.load(memory_order_dependency);
	*(cp + 1) = r1;
	for (cp1 = cp; *cp1 != '\0'; cp1++, i++)
		continue;
	r2 = *(y + i);

If *cp=='\0', there is no data dependency, otherwise there is.
My kneejerk reaction is that the programmer is simply getting what he
or she deserves in this case if they get ordering only sometimes in this
case, but there might be better examples.  Thoughts?

> > In the long-long case, the most interesting situation is on a 
> > 32-bit machine.  For this to be interesting, the earlier code 
> > must have had a marked dependency head that propagated to the 
> > upper half of the long long, but not to the lower half.  So 
> > you are thinking of something like the following, where ll is 
> > a long long that has been promoted to a register or something?
> > 
> > 	r1 = x.load(memory_order_dependency);
> > 	ll += ((long long)r1) << 32;
> > 	r2 = *(y + (ll & ((1LL << 32) - 1)))
> > 
> > If the instances of "32" were instead "31", there would 
> > clearly be a dependency.  For the sake of the programmer's 
> > sanity, I would argue for the above also carrying a 
> > dependency, so that the load into
> > r2 would be dependency-ordered after the load into r1.
> > 
> > Seem reasonable?
>
> Maybe.  We can change the example to to
> 
>  	r1 = x.load(memory_order_dependency);
>  	ll += ((0x1 << 32) & r1);
>  	r2 = (ll != 0);  // Comparison checks lowest half first; lowest
> half is nonzero
> 	r3 = y[r2];
> 
> Now I suspect that some compilers would generate code that in fact has
> no data dependency from r1 to r2, though I haven't verified that.

In this case, I would argue that the compiler needs to either emit an
explicit memory fence or suppress the optimization that avoids writing
the low half of ll, due to the presence of the dependency chain.

							Thanx, Paul

> Hans
> > 
> > 						Thanx, Paul
> > 
> > --
> > cpp-threads mailing list
> > cpp-threads at decadentplace.org.uk
> > http://www.decadentplace.org.uk/cgi-bin/mailman/listinfo/cpp-threads
> >