[cpp-threads] A question about N2153

Thu Jan 18 00:09:01 GMT 2007

[...]

>> load_depends == *#LoadDepends | #LoadLoad

> Ummm...  On all CPUs other than Alpha, you don't need -any- fencing,
> barriers, or special instructions to cause the CPU to respect
> ordering on data dependencies.

I know... Of course load_depends would be a NOP on everything except Alpha.

>> *- I would like to see this barrier added to the SPARC membar instruction
>> operands..
>>
>> Na! Perhaps something like this:
>>
>> Well, the following language
>> additions would be nice:
>>
>> Okay, a normal load from a shared variable in C:
>>
>> static void *x = 0;
>>
>> int main(void) {
>>  void *l = x; /* naked load */
>>  return 0;
>> }
>>
>> Well, how about adding support for the following logic:
>>
>> static void *x = 0;
>>
>> int main(void) {
>>  void *l:(#LoadStore | #LoadLoad) = x; /* acquire load */
>>
>>  l:(#LoadDepends | #LoadLoad) = x; /* dependant load */
>>
>>  /* here is what a store could look like: */
>>  x:(#LoadStore | #StoreStore) = l; /* store release */
>>  return 0;
>> }
>>
>> Humm...
>
> I am looking for something -considerably- lighter weight for read-side
> access.  This is simplified, and assumes that (1) the compiler does
> not invent writes, (2) the normal loads and stores atomically access
> properly aligned basic data such as pointers, and (3) the CPU respects
> data-dependency ordering.  The first two hold on all 25 CPU families that
> Linux runs on given Linux's use of gcc, and last one holds on all CPU
> families aside from Alpha, where an smp_rmb() barrier (AKA LoadLoad)
> is required -- except that Alpha doesn't have a LoadLoad, so a full
> memory barrier is used.  (The Linux rcu_dereference() primitive does
> the right thing for whatever CPU is in use.)
>
> The key point is that the read-side code (the foo_present() function
> in this case) uses exactly the same sequence of instructions that
> would be used in a single-threaded implementation, even if the
> list is subject to change.
>
> This sort of technique is used in production in a number of operating
> systems, including Linux.

Right. RCU, RCU+SMR and vZOOM all depend on implicit #LoadLoad... As for 
compiler reordering, well section 2 of the following text is what I have in 
mind:

http://appcore.home.comcast.net/vzdoc/atomic/static-init/

> struct foo {
> struct foo *next;
> int a;
> };
>
> struct foo *gfp = NULL;
> DEFINE_SPINLOCK(gfp_lock);
>
> /*
> * Reader -- works on non-Alpha.  Note: identical to single-threaded code.
> */
>
> int foo_present(int key)
> {
> struct foo *fp;
>
> for (fp = gfp; fp != NULL; fp = fp->next) {
> if (fp->a == key)
> return 1;
> }
> return 0;
> }
>
> /* Writer.  Deletion can be easily handled via RCU. */
>
> int insert_foo(int key)
> {
> struct foo *fp;
>
> spin_lock(&gfp_lock);
> if (foo_present(key)) {
> spin_unlock(&gfp_lock);
> return 0;
> }
> if ((fp = malloc(sizeof(*fp))) == NULL) {
> spin_unlock(&gfp_lock);
> return 0;
> }
> fp->a = key;
> smp_wmb();
> fp->next = gfp;
> spin_unlock(&gfp_lock);
> return 1;
> }
>
> /* On x86: */
>
> #define wmb()   __asm__ __volatile__ ("": : :"memory")
> #define smp_wmb()       wmb()
>
> /* On POWER: */
>
> #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> #define smp_wmb()       __asm__ __volatile__ ("eieio" : : : "memory")
>
> We might be suffering from differing views of what constitutes low
> overhead.  ;-)

Nope. I know all about read_barrier_depends... I also know that #StoreStore 
and data-dependant loads are a must wrt excellent scalability. Here are some 
of my thoughts wrt scalability and overall performance:

http://appcore.home.comcast.net/vzoom/round-1.pdf

http://appcore.home.comcast.net/vzoom/round-2.pdf

Any thoughts on this? I think we are on the same road wrt synchronization 
overheads attributed to expensive memory barriers for the read-size of a 
lock-free reader pattern...

;^)