[cpp-threads] C++ memory-fence API proposal
Paul E. McKenney
paulmck at linux.vnet.ibm.com
Sun Apr 15 09:22:23 BST 2007
The following fence API proposal is loosely based on one from Doug Lea.
This differs from Doug's proposal in (1) defining compiler fences,
(2) providing template-based function definitions in order to bury the
assignments into the fence primitive, and (3) being composed of naked
function definitions rather than a class. This latter is required
due to the way templates work in C++: a class template seems to need
the specified type to be passed in to all of its members, which is not
possible given that several of these definitions have neither arguments
nor return value. Another option would be to include the template-based
member functions in a non-template class. This might be desireable from
a namespace standpoint, but does not seem compelling at the moment.
(Feel free to educate me on this latter point.)
void compilerFence(void);
void postLoadFence(void);
void preStoreFence(void);
void postStorePreLoadFence(void);
template<typename T> T perObjectCompilerPostLoadFence(T *base);
template<typename T> void perObjectCompilerPreStoreFence(T *base, T newval);
template<typename T> T perObjectPostLoadFence(T *base);
template<typename T> void perObjectPreStoreFence(T *base, T newval);
Each of these functions is described in the following sections.
void compilerFence(void);
Prevents the compiler from applying code-motion optimizations
that would permit memory references to be reordered across
the compilerFence() invocation. This is useful when using
variables local to a given thread that are used to communicate
with signal/interrrupt handlers. An example use is as
follows in mainline code:
critical_section = 1;
compilerFence();
do_something_critical();
compilerFence();
critical_section = 0;
if (need_deferred) {
block_handlers(); /* e.g., block signals. */
non_critical_processing();
need_deferred = 0;
unblock_handlers();
}
This might be combined with the following invoked from an
interrupt or signal handler:
if (critical_section)
need_deferred = 1;
else
non_critical_processing();
void postLoadFence(void);
This has also been called an "acquire fence". It guarantees
that any subsequent loads and stores are ordered after any
prior load. This primitive thus can be said to have "subscribe"
semantics. An example use is as follows:
if (globalflag) {
postLoadFence();
do_something_with(globaldata);
}
void preStoreFence(void);
This has also been called a "release fence". It guarantees that
any prior loads and stores are ordered before any subsequent
stores. This primitive can be said to have "publish" semantics.
An example use is as follows:
globaldata = define_global_data();
preStoreFence();
globalflag = 1;
void postStorePreLoadFence(void);
This has also been called a "full fence". It guarantees that
any prior loads and stores are ordered before any subsequent
loads and stores. On many hardware implementations, this
primitive is significantly more expensive than postLoadFence()
or preStoreFence(). One classic use is in Dekker's algorithm,
loosely as follows:
T1 T2
y = 1; x = 1;
postStorePreLoadFence(); postStorePreLoadFence();
r1 = x; r1 = y;
template<typename T> T perObjectCompilerPostLoadFence(T *base);
template<typename T> void perObjectCompilerPreStoreFence(T *base, T newval);
These two primitives can be used in signal/interrupt handlers in
a manner similar to compilerFence() in some cases, but they
permit the compiler more freedom. Where compilerFence()
prohibits all code motion, perObjectCompilerPostLoadFence() and
perObjectCompilerPreStoreFence() prohibit motion only of code
referring to the same variable or of other fence operations.
Examples where these primitives would be useful may be found
in some of the realtime RCU implementations presented in:
http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf
template<typename T> T perObjectPostLoadFence(T *base);
This primitive can be thought of as a light-weight subscribe
operation where there is a data dependency from the
data-availability indication and the data itself, for example,
when the indication is the pointer to the data. An example is
as follows:
struct pubdata {
int a;
int b;
};
struct pubdata *pd = NULL;
int x;
int y;
/* ... */
localpd = perObjectPostLoadFence(&pd);
r1 = localpd->a; /* ordered. */
localpd->b = 2; /* ordered. */
x = 3; /* not ordered: no dependency. */
r1 = y; /* not ordered: no dependency. */
This intrinsic could be used to implement the rcu_dereference()
primitive in the Linux kernel.
Please note that reliance on data dependencies has been quite
controversial throughout the C/C++ memory-model standardization
process. That said, compilers for architectures preferring
not to rely on dependencies can simply emit appropriate fence
instructions.
Note that it is often desireable for data-dependency to pass
through functions in separate compilation units. This is
handled by decorating the function declaration and definition
as described elsewhere. Please note that dependency chains
are -not- guaranteed to propagate through undecorated functions.
template<typename T> void perObjectPreStoreFence(T *base, T newval);
This intrinsic orders prior stores to a structure with
subsequent storing of a pointer to that structure. It can
be thought of as a data-dependency publish corresponding to
the perObjectPostLoadFence() data-dependency subscribe.
An example is as follows, continuing with the previous data
structures:
p->a = 1; /* ordered. */
p->b = 1; /* ordered. */
x = 1; /* not ordered, no dependency. */
y = 1; /* not ordered, no dependency. */
perObjectPreStoreFence(&pd, p);
This intrinsic could be used to implement the rcu_assign_pointer()
primitive in the Linux kernel.
Implementation
The following is an implementation of the Fences class for
recent x86 CPUs, assuming a rumored tighter memory-order
definition for such CPUs. Note that this implementation
assumes that the compiler refrains from emitting weakly
ordered SSE/MMX instructions.
static inline void
compilerFence(void) {
__asm__ __volatile__("":::"memory");
}
static inline void
postLoadFence(void) {
compilerFence();
}
static inline void
preStoreFence(void) {
compilerFence();
}
static inline void
postStorePreLoadFence(void) {
__asm__ __volatile__("mfence":::"memory");
}
template<typename T> static inline T
perObjectCompilerPostLoadFence(T *base) {
T retval = *(volatile T *)base;
return retval;
}
template<typename T> static inline void
perObjectCompilerPreStoreFence(T *base, T newval) {
*(volatile T *)base = newval;
}
template<typename T> static inline T
perObjectPostLoadFence(T *base) {
T retval = *(volatile T *)base;
return retval;
}
template<typename T> static inline void
perObjectPreStoreFence(T *base, T newval) {
*(volatile T *)base = newval;
}
In contrast, the currently documented memory model for this CPU
family, which allows loads to be reordered with other loads and
with prior stores, would require something like the following.
static inline void
compilerFence(void) {
__asm__ __volatile__("":::"memory");
}
static inline void
postLoadFence(void) {
__asm__ __volatile__("mfence":::"memory");
}
static inline void
preStoreFence(void) {
compilerFence();
}
static inline void
postStorePreLoadFence(void) {
__asm__ __volatile__("mfence":::"memory");
}
template<typename T> static inline T
perObjectCompilerPostLoadFence(T *base) {
T retval = *(volatile T *)base;
return retval;
}
template<typename T> static inline void
perObjectCompilerPreStoreFence(T *base, T newval) {
*(volatile T *)base = newval;
}
template<typename T> static inline T
perObjectPostLoadFence(T *base) {
T retval = *(volatile T *)base;
return retval;
}
template<typename T> static inline void
perObjectPreStoreFence(T *base, T newval) {
*(volatile T *)base = newval;
}
A weakly ordered system such as POWER might be implemented
as follows (but please note that this has not yet been vetted
by POWER architects):
static inline void
compilerFence(void) {
__asm__ __volatile__("":::"memory");
}
static inline void
postLoadFence(void) {
__asm__ __volatile__("lwsync":::"memory");
}
static inline void
preStoreFence(void) {
__asm__ __volatile__("lwsync":::"memory");
}
static inline void
postStorePreLoadFence(void) {
__asm__ __volatile__("sync":::"memory");
}
template<typename T> static inline T
perObjectCompilerPostLoadFence(T *base) {
T retval = *(volatile T *)base;
return retval;
}
template<typename T> static inline void
perObjectCompilerPreStoreFence(T *base, T newval) {
*(volatile T *)base = newval;
}
template<typename T> static inline T
perObjectPostLoadFence(T *base) {
T retval = *(volatile T *)base;
return retval;
}
template<typename T> static inline void
perObjectPreStoreFence(T *base, T newval) {
__asm__ __volatile__("eieio":::"memory");
*(volatile T *)base = newval;
}
More information about the cpp-threads
mailing list