[cpp-threads] C++ memory-fence API proposal

Sun Apr 15 09:22:23 BST 2007

The following fence API proposal is loosely based on one from Doug Lea.
This differs from Doug's proposal in (1) defining compiler fences,
(2) providing template-based function definitions in order to bury the
assignments into the fence primitive, and (3) being composed of naked
function definitions rather than a class.  This latter is required
due to the way templates work in C++: a class template seems to need
the specified type to be passed in to all of its members, which is not
possible given that several of these definitions have neither arguments
nor return value.  Another option would be to include the template-based
member functions in a non-template class.  This might be desireable from
a namespace standpoint, but does not seem compelling at the moment.
(Feel free to educate me on this latter point.)

void compilerFence(void);
void postLoadFence(void);
void preStoreFence(void);
void postStorePreLoadFence(void);
template<typename T> T perObjectCompilerPostLoadFence(T *base);
template<typename T> void perObjectCompilerPreStoreFence(T *base, T newval);
template<typename T> T perObjectPostLoadFence(T *base);
template<typename T> void perObjectPreStoreFence(T *base, T newval);

Each of these functions is described in the following sections.

void compilerFence(void);

	Prevents the compiler from applying code-motion optimizations
	that would permit memory references to be reordered across
	the compilerFence() invocation.  This is useful when using
	variables local to a given thread that are used to communicate
	with signal/interrrupt handlers.  An example use is as
	follows in mainline code:

		critical_section = 1;
		compilerFence();
		do_something_critical();
		compilerFence();
		critical_section = 0;
		if (need_deferred) {
			block_handlers();  /* e.g., block signals. */
			non_critical_processing();
			need_deferred = 0;
			unblock_handlers();
		}

	This might be combined with the following invoked from an
	interrupt or signal handler:

		if (critical_section)
			need_deferred = 1;
		else
			non_critical_processing();

void postLoadFence(void);

	This has also been called an "acquire fence".  It guarantees
	that any subsequent loads and stores are ordered after any
	prior load.  This primitive thus can be said to have "subscribe"
	semantics.  An example use is as follows:

		if (globalflag) {
			postLoadFence();
			do_something_with(globaldata);
		}

void preStoreFence(void);

	This has also been called a "release fence".  It guarantees that
	any prior loads and stores are ordered before any subsequent
	stores.  This primitive can be said to have "publish" semantics.
	An example use is as follows:

		globaldata = define_global_data();
		preStoreFence();
		globalflag = 1;

void postStorePreLoadFence(void);

	This has also been called a "full fence".  It guarantees that
	any prior loads and stores are ordered before any subsequent
	loads and stores.  On many hardware implementations, this
	primitive is significantly more expensive than postLoadFence()
	or preStoreFence().  One classic use is in Dekker's algorithm,
	loosely as follows:

		T1				T2

		y = 1;				x = 1;
		postStorePreLoadFence();	postStorePreLoadFence(); 
		r1 = x;				r1 = y;

template<typename T> T perObjectCompilerPostLoadFence(T *base);
template<typename T> void perObjectCompilerPreStoreFence(T *base, T newval);

	These two primitives can be used in signal/interrupt handlers in
	a manner similar to compilerFence() in some cases, but they
	permit the compiler more freedom.  Where compilerFence()
	prohibits all code motion, perObjectCompilerPostLoadFence() and
	perObjectCompilerPreStoreFence() prohibit motion only of code
	referring to the same variable or of other fence operations.

	Examples where these primitives would be useful may be found
	in some of the realtime RCU implementations presented in:

	http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf

template<typename T> T perObjectPostLoadFence(T *base);

	This primitive can be thought of as a light-weight subscribe
	operation where there is a data dependency from the
	data-availability indication and the data itself, for example,
	when the indication is the pointer to the data.  An example is
	as follows:

		struct pubdata {
			int a;
			int b;
		};

		struct pubdata *pd = NULL;
		int x;
		int y;

		/* ... */

		localpd = perObjectPostLoadFence(&pd);
		r1 = localpd->a;  /* ordered. */
		localpd->b = 2;   /* ordered. */
		x = 3;		  /* not ordered: no dependency. */
		r1 = y;		  /* not ordered: no dependency. */

	This intrinsic could be used to implement the rcu_dereference()
	primitive in the Linux kernel.

	Please note that reliance on data dependencies has been quite
	controversial throughout the C/C++ memory-model standardization
	process.  That said, compilers for architectures preferring
	not to rely on dependencies can simply emit appropriate fence
	instructions.

	Note that it is often desireable for data-dependency to pass
	through functions in separate compilation units.  This is
	handled by decorating the function declaration and definition
	as described elsewhere.  Please note that dependency chains
	are -not- guaranteed to propagate through undecorated functions.

template<typename T> void perObjectPreStoreFence(T *base, T newval);

	This intrinsic orders prior stores to a structure with
	subsequent storing of a pointer to that structure.  It can
	be thought of as a data-dependency publish corresponding to
	the perObjectPostLoadFence() data-dependency subscribe.
	An example is as follows, continuing with the previous data
	structures:

		p->a = 1;	/* ordered. */
		p->b = 1;	/* ordered. */
		x = 1;		/* not ordered, no dependency. */
		y = 1;		/* not ordered, no dependency. */
		perObjectPreStoreFence(&pd, p);

	This intrinsic could be used to implement the rcu_assign_pointer()
	primitive in the Linux kernel.

Implementation

	The following is an implementation of the Fences class for
	recent x86 CPUs, assuming a rumored tighter memory-order
	definition for such CPUs.  Note that this implementation
	assumes that the compiler refrains from emitting weakly
	ordered SSE/MMX instructions.

		static inline void
		compilerFence(void) {
			__asm__ __volatile__("":::"memory");
		}

		static inline void
		postLoadFence(void) {
			compilerFence();
		}

		static inline void
		preStoreFence(void) {
			compilerFence();
		}

		static inline void
		postStorePreLoadFence(void) {
			__asm__ __volatile__("mfence":::"memory");
		}

		template<typename T> static inline T
		perObjectCompilerPostLoadFence(T *base) {
			T retval = *(volatile T *)base;

			return retval;
		}

		template<typename T> static inline void
		perObjectCompilerPreStoreFence(T *base, T newval) {
			*(volatile T *)base = newval;
		}

		template<typename T> static inline T
		perObjectPostLoadFence(T *base) {
			T retval = *(volatile T *)base;

			return retval;
		}

		template<typename T> static inline void
		perObjectPreStoreFence(T *base, T newval) {
			*(volatile T *)base = newval;
		}

	In contrast, the currently documented memory model for this CPU
	family, which allows loads to be reordered with other loads and
	with prior stores, would require something like the following.

		static inline void
		compilerFence(void) {
			__asm__ __volatile__("":::"memory");
		}

		static inline void
		postLoadFence(void) {
			__asm__ __volatile__("mfence":::"memory");
		}

		static inline void
		preStoreFence(void) {
			compilerFence();
		}

		static inline void
		postStorePreLoadFence(void) {
			__asm__ __volatile__("mfence":::"memory");
		}

		template<typename T> static inline T
		perObjectCompilerPostLoadFence(T *base) {
			T retval = *(volatile T *)base;

			return retval;
		}

		template<typename T> static inline void
		perObjectCompilerPreStoreFence(T *base, T newval) {
			*(volatile T *)base = newval;
		}

		template<typename T> static inline T
		perObjectPostLoadFence(T *base) {
			T retval = *(volatile T *)base;

			return retval;
		}

		template<typename T> static inline void
		perObjectPreStoreFence(T *base, T newval) {
			*(volatile T *)base = newval;
		}

	A weakly ordered system such as POWER might be implemented
	as follows (but please note that this has not yet been vetted
	by POWER architects):

		static inline void
		compilerFence(void) {
			__asm__ __volatile__("":::"memory");
		}

		static inline void
		postLoadFence(void) {
			__asm__ __volatile__("lwsync":::"memory");
		}

		static inline void
		preStoreFence(void) {
			__asm__ __volatile__("lwsync":::"memory");
		}

		static inline void
		postStorePreLoadFence(void) {
			__asm__ __volatile__("sync":::"memory");
		}

		template<typename T> static inline T
		perObjectCompilerPostLoadFence(T *base) {
			T retval = *(volatile T *)base;

			return retval;
		}

		template<typename T> static inline void
		perObjectCompilerPreStoreFence(T *base, T newval) {
			*(volatile T *)base = newval;
		}

		template<typename T> static inline T
		perObjectPostLoadFence(T *base) {
			T retval = *(volatile T *)base;

			return retval;
		}

		template<typename T> static inline void
		perObjectPreStoreFence(T *base, T newval) {
			__asm__ __volatile__("eieio":::"memory");
			*(volatile T *)base = newval;
		}