[cpp-threads] Review comments on N2176 WRT dependency ordering

Wed Apr 18 06:50:44 BST 2007

On Wed, Apr 18, 2007 at 02:17:15AM +0300, Peter Dimov wrote:
> Paul E. McKenney wrote:
> >On Wed, Apr 18, 2007 at 12:32:02AM +0300, Peter Dimov wrote:
> >>Paul E. McKenney wrote:
> >>>On Tue, Apr 17, 2007 at 02:38:29PM +0300, Peter Dimov wrote:
> >>
> >>>>In the dependency discussion paper, you write that a single level
> >>>>of indirection is not enough for the Linux kernel, but I don't see
> >>>>how you could make a multi-level primitive work on an Alpha.
> >>>
> >>>This depends on how the multi-level list was created. [...]
> >>
> >>Yes, you are right. atomic_load_address will probably always work in
> >>practice for such dependencies, but I'm not sure that I can specify
> >>it so that it will also work in theory.
> >
> >I would like to help here, but need to understand what is broken in
> >theory.  The dependency chains seem to me to be pretty well defined,
> >particularly given that their heads are marked and that any passage of
> >the chain to a different compilation unit is also marked in both the
> >function prototype and the function definition.
> >
> >So, what am I missing here?
> 
> Actually, on third reading, it will work if the dependencies are tagged and 
> tracked as you suggest in your paper. I've no idea how hard this would be 
> to implement. :-)

;-)

> I also noticed that you don't seem to distinguish between data and code 
> dependencies... is that on purpose? My understanding is that PowerPC 
> enforces data dependencies for both loads and stores but does not enforce 
> code dependencies for loads?

I am personally much more concerned about data dependencies than code
dependencies, but others are worred about code dependencies as well.
PowerPC can indeed perform loads speculatively (as long as they are
not data-dependent on some earlier value).  From what I hear, ARM does
not enforce code dependencies at all, and Itanium enforces at least some
code dependencies.  PowerPC provides isync to enforce code dependencies
relatively cheaply, not sure about ARM.  Also, in many cases it is
quite easy to manufacture a data dependence where needed.

It is also easy to manually terminate a dependency chain -- just
invoke a function that does not preserve dependencies.

> >>The explicit per-object fence can handle these scenarios - in
> >>principle - since the user specifies the dependent object explicitly.
> >>
> >>i = atomic_load_relaxed( myindex );
> >>
> >>dependency_fence( &myarray[ i ] );
> >>dependency_fence( &myarray2[ i * 2 ] );
> >
> >OK, I'll bite -- what are the above dependency_fence() calls doing,
> >either in theory or in practice?
> 
> The hypothetical dependency_fence( p ) calls mark *p as a dependent object. 
> I'm not sure that they simplify the optimizer implementation compared to 
> your suggestion, though. Maybe not.

OK.

> >>r1 = myarray[ i ].foo;
> >>r2 = myarray[ i ].bar;
> >>r3 = myarray[ i*2 ].baz;
> 
> There's also the option of just providing dependency_fence() which maps to 
> atomic_compiler_fence( __acquire ) on everything except Alpha, where it 
> inserts an additional rmb as well. 

And for strongly ordered machines (e.g., C4 or TSO), this makes sense.
The required acquire fence is free (aside from preventing some compiler
optimizations).  However, on weakly ordered machines that enforce data
dependencies, the acquire fence would be excessively expensive.

So, yes, backends for strongly ordered machines and for Alpha can
ignore this issue entirely simply by emitting an acquire fence for
the atomic_load_relaxed().  But this gets quite expensive for weakly
ordered machines.

						Thanx, Paul