[cpp-threads] Fences and atomic-variable-based coherence for large machines

Tue Apr 17 23:22:37 BST 2007

This is an attempt to come up with ways of converging the need for
explicit standalone fences with the desire for atomic-variable-based
coherence for posited large machines.  Since I do not have a complete
picture of what these posited large machines might look like, this list
is necessarily rather speculative in nature.  But it is at least a start.

1.	When generating code for machines using atomic-variable-based
	coherence, invent a global variable.  Upgrade acquire and
	release fences to acq_rel fences if available, or to fully
	ordered fences if not.	The compiler is free to determine what
	sort of access to the invented variable is, be it load, store,
	atomic read-modify-write, or whatever.  This should function
	on such a machine, though performance might be unattractive,
	depending on how the posited large machine operates.

2.	As above, but invent multiple global variables for programs
	with partitionable data sets.  For example, a compilation
	unit with entirely private data accessed by multiple threads
	could have its own invented global variable.  The exact set of
	feasible partitioning strategies might well depend on details
	of the posited large machine's implementation.

3.	Let us assume that each unit of the posited large machine tracks
	modified variables and exports them upon ordered store and store
	release, tagging this export with the variable being stored to.
	Let us further assume that ordered load and load acquire result
	in a query of recent exports for that same variable, and with
	importing of any corresponding exported variables.  Then a bare
	fence can result in exporting prior stores tagged by any following
	loads, and also in importing any exports tagged by any following
	loads.

	Whatever strategy is used to limit the size of export and
	import lists can presumably also be applied in this case --
	presumably there is such a strategy, particularly for
	data-intensive workloads.  In addition, note that the normal
	loads and stores need not result in any querying -- the query
	can be compiler-generated and associated with the fence itself.

4.	The posited large machine might well use an invalidation
	strategy, so that imports receive not new values, but rather
	lists of memory regions to be invalidated.  Such a strategy
	might work well in cases where data items were relatively
	large and where there was a high probability that a given
	exported variable is rarely actually read.  The approach of
	#3 could be adapted to this case as well.

5.	The posited large machine would need to correctly handle the
	case where each CPU performed partially conflicting writes to a
	large group of non-atomic variables, then did a store-release to a
	single global variable.  In particular, the posited large machine
	would need to globally determine which store-release "won" for
	each group of conflicting writes to normal variables.  It should
	be possible to leverage whatever mechanism is implemented to
	perform this global determination in order to similarly resolve
	explicit fences.

As noted earlier, a start...

						Thanx, Paul