[cpp-threads] A question about N2153

Wed Jan 17 23:29:19 GMT 2007

On Wed, Jan 17, 2007 at 12:53:15PM -0800, Chris Thomasson wrote:
> 
> >On Wed, Jan 17, 2007 at 11:56:29AM -0800, Chris Thomasson wrote:
> >>----- Original Message -----
> >>From: "Paul E. McKenney" <paulmck at linux.vnet.ibm.com>
> >>To: "C++ threads standardisation" <cpp-threads at decadentplace.org.uk>
> >>Sent: Wednesday, January 17, 2007 11:20 AM
> >>Subject: Re: [cpp-threads] A question about N2153
> >>
> >>
> >>>On Wed, Jan 17, 2007 at 08:44:00PM +0200, Peter Dimov wrote:
> >>>>What is the reference implementation of acquire_fence for x86, SPARC
> >>>>RMO,
> >>>>PowerPC, IA-64? I'm guessing (no op), #LoadLoad | #LoadStore, lwsync,
> >>>>mf.
> >>>
> >>>Stupid side question...  Why acquire_fence and release_fence?  This is
> >>>the only group I have come across that uses these.  Where I come from,
> >>>we use the following types:
> >>
> >>Well, an acquire_"fence" is like this:
> >>
> >>// do your atomic operation here
> >>#StoreLoad | #StoreStore
> >>
> >>And a release_"fence" is like this:
> >>
> >>#LoadStore | #StoreStore
> >>// do your atomic operation here
> >
> >OK.  But I already understood what you guys meant by these two primitives.
> >My question is why choose these particular definitions?  From my
> >viewpoint, the definitions are quite strange, and they don't do what I
> >want in many cases.  I often need a StoreStore without the StoreLoad,
> 
> Yeah... Same here. The acquire/release thin is closely related to the
> barriers that you need wrt implementing a POSIX mutex.

Even that seems a bit off -- the Itanium's definition of acquire and
release would be closer, right?

> >and
> >I often need to enforce data dependencies.  This latter especially is not
> >supported well by the cpp-threads definitions, as most CPUs don't need any
> >barriers in that case (just disable compiler code-motion optimizations).
> >
> >Thoughts?
> 
> Something like this might be better:
> 
> load_acquire == #LoadStore | #LoadLoad
> load_depends == *#LoadDepends | #LoadLoad
> load_ordered == #LoadStore
> load_relaxed == #LoadLoad
> store_acquire == #StoreLoad | #StoreStore
> store_release == #LoadStore | #StoreStore
> store_ordered == #StoreLoad
> store_relaxed == #StoreStore
> full_prolog == store_release
> full_epilog == store_ordered

Ummm...  On all CPUs other than Alpha, you don't need -any- fencing,
barriers, or special instructions to cause the CPU to respect
ordering on data dependencies.

> *- I would like to see this barrier added to the SPARC membar instruction
> operands..
> 
> Na! Perhaps something like this:
> 
> Well, the following language
> additions would be nice:
> 
> Okay, a normal load from a shared variable in C:
> 
> static void *x = 0;
> 
> int main(void) {
>  void *l = x; /* naked load */
>  return 0;
> }
> 
> Well, how about adding support for the following logic:
> 
> static void *x = 0;
> 
> int main(void) {
>  void *l:(#LoadStore | #LoadLoad) = x; /* acquire load */
> 
>  l:(#LoadDepends | #LoadLoad) = x; /* dependant load */
> 
>  /* here is what a store could look like: */
>  x:(#LoadStore | #StoreStore) = l; /* store release */
>  return 0;
> }
> 
> Humm...

I am looking for something -considerably- lighter weight for read-side
access.  This is simplified, and assumes that (1) the compiler does
not invent writes, (2) the normal loads and stores atomically access
properly aligned basic data such as pointers, and (3) the CPU respects
data-dependency ordering.  The first two hold on all 25 CPU families that
Linux runs on given Linux's use of gcc, and last one holds on all CPU
families aside from Alpha, where an smp_rmb() barrier (AKA LoadLoad)
is required -- except that Alpha doesn't have a LoadLoad, so a full
memory barrier is used.  (The Linux rcu_dereference() primitive does
the right thing for whatever CPU is in use.)

The key point is that the read-side code (the foo_present() function
in this case) uses exactly the same sequence of instructions that
would be used in a single-threaded implementation, even if the
list is subject to change.

This sort of technique is used in production in a number of operating
systems, including Linux.

struct foo {
	struct foo *next;
	int a;
};

struct foo *gfp = NULL;
DEFINE_SPINLOCK(gfp_lock);

/*
 * Reader -- works on non-Alpha.  Note: identical to single-threaded code.
 */

int foo_present(int key)
{
	struct foo *fp;

	for (fp = gfp; fp != NULL; fp = fp->next) {
		if (fp->a == key)
			return 1;
	}
	return 0;
}

/* Writer.  Deletion can be easily handled via RCU. */

int insert_foo(int key)
{
	struct foo *fp;

	spin_lock(&gfp_lock);
	if (foo_present(key)) {
		spin_unlock(&gfp_lock);
		return 0;
	}
	if ((fp = malloc(sizeof(*fp))) == NULL) {
		spin_unlock(&gfp_lock);
		return 0;
	}
	fp->a = key;
	smp_wmb();
	fp->next = gfp;
	spin_unlock(&gfp_lock);
	return 1;
}

/* On x86: */

#define wmb()   __asm__ __volatile__ ("": : :"memory")
#define smp_wmb()       wmb()

/* On POWER: */

#define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
#define smp_wmb()       __asm__ __volatile__ ("eieio" : : : "memory")

We might be suffering from differing views of what constitutes low
overhead.  ;-)

							Thanx, Paul