[cpp-threads] Fences and atomic-variable-based coherence for large
machines
Paul E. McKenney
paulmck at linux.vnet.ibm.com
Tue Apr 17 23:22:37 BST 2007
This is an attempt to come up with ways of converging the need for
explicit standalone fences with the desire for atomic-variable-based
coherence for posited large machines. Since I do not have a complete
picture of what these posited large machines might look like, this list
is necessarily rather speculative in nature. But it is at least a start.
1. When generating code for machines using atomic-variable-based
coherence, invent a global variable. Upgrade acquire and
release fences to acq_rel fences if available, or to fully
ordered fences if not. The compiler is free to determine what
sort of access to the invented variable is, be it load, store,
atomic read-modify-write, or whatever. This should function
on such a machine, though performance might be unattractive,
depending on how the posited large machine operates.
2. As above, but invent multiple global variables for programs
with partitionable data sets. For example, a compilation
unit with entirely private data accessed by multiple threads
could have its own invented global variable. The exact set of
feasible partitioning strategies might well depend on details
of the posited large machine's implementation.
3. Let us assume that each unit of the posited large machine tracks
modified variables and exports them upon ordered store and store
release, tagging this export with the variable being stored to.
Let us further assume that ordered load and load acquire result
in a query of recent exports for that same variable, and with
importing of any corresponding exported variables. Then a bare
fence can result in exporting prior stores tagged by any following
loads, and also in importing any exports tagged by any following
loads.
Whatever strategy is used to limit the size of export and
import lists can presumably also be applied in this case --
presumably there is such a strategy, particularly for
data-intensive workloads. In addition, note that the normal
loads and stores need not result in any querying -- the query
can be compiler-generated and associated with the fence itself.
4. The posited large machine might well use an invalidation
strategy, so that imports receive not new values, but rather
lists of memory regions to be invalidated. Such a strategy
might work well in cases where data items were relatively
large and where there was a high probability that a given
exported variable is rarely actually read. The approach of
#3 could be adapted to this case as well.
5. The posited large machine would need to correctly handle the
case where each CPU performed partially conflicting writes to a
large group of non-atomic variables, then did a store-release to a
single global variable. In particular, the posited large machine
would need to globally determine which store-release "won" for
each group of conflicting writes to normal variables. It should
be possible to leverage whatever mechanism is implemented to
perform this global determination in order to similarly resolve
explicit fences.
As noted earlier, a start...
Thanx, Paul
More information about the cpp-threads
mailing list