[cpp-threads] Yet another visibility question

Thu Jan 4 08:58:04 GMT 2007

"From: cpp-threads-bounces at decadentplace.org.uk
[mailto:cpp-threads-bounces at decadentplace.org.uk] On Behalf Of Hans Boehm
Sent: Saturday, December 30, 2006 6:12 PM"

[...]
>> // x y initially zero
>>
>> // threads 1 to N:
>>
>> assert( x == 0 );

Okay. Seems like your initializing the variables before you create any
threads. Therefore, you can rely on the implicit (#LoadStore | #StoreStore
)memory barrier in the function that actually creates a thread by making a
call into an "os threading api" (pthread, win, unix, ect...) . Of course
this means that a join would logically have an implicit (#StoreLoad |
#StoreStore).  Fine, that's what I would expect anyway. Thread creation API
does not really need to be tuned wrt memory barriers; thread creation in the
abstract should be thought of as a "fairly expensive" operation anyway...
So, the expensive acquire/release barrier functionality is not a bit hit wrt
performance...

[...]
>> if( fetchadd_release( &y, 1 ) == N - 1 )

Well, you use the term "release"... So of course, I go (#LoadStore |
#StoreStore)... Some algorithms might only need to do a simple #StoreStore
with no dependency hints. Now the (#StoreStore) fast-path, gives us the
ability to create a slow-path which we will defer the calls into the
expensive barriers. The sema/barrier logic needs to ensure that certain
loads and stores are visible to any potential thread that may be interested
issuing some loads on it. The dependencies between the loads and the stores
(e.g., SPARC style #LoadStore and #StoreLoad ordering constraints). For
instance... You always have to think about the possibility of programs that
make use of very weak ordering constraints. Think RCU or vZOOM, Paul E.
McKenney knows what up. RCU/vZOOM can exist on NUMA or even the alpha... So,
weak models are a must.

<pseudo-code>
// atomicword_t is assume unsigned and at least sizeof(void*)
typedef unsigned long atomicword_t;
#define ATOMIC_VOLATILE() volatile
typedef atomicword_t semgate_t;
typedef semgate_t ATOMIC_VOLATILE() *semgate_dest_t;

// assume init to zero.  This has full barrier on the slow-path.
// therefore, your acquire logic will be held.
bool semgate_waittest(semgate_dest_t _this, semgate_dest_t px, atomicword_t
target) {
  bool initx = false;
  semgate_t val;

  do {
    val = atomicword_load_naked(_this);

    // test the fast-path
    if (val + 1 != target &&
        atomicword_cas_store(_this, val, val + 1)) {
      return false;

    // only init one time!
    } else if(! initx) {
      // naked x before the barrier
      atomicword_store_naked(px, 1);

      // okay, we did a local update
      initx = true;
    }

  // we can handle you logic like this; slow-path ;^(...
  } while(atomicword_cas_full(_this, val, val + 1));

  return true;
}

//usage 
#define DEPTH() 4 // e.g., N
static semgate_t x = 0, y = 0;

// do a waittest... Good thing it has a fast-path...
if (semgate_waittest(&y, &x, DEPTH())) {
  // well, the "full" barrier was hit...
}

You can create the algorithm this way, or you can do a trick if you happen
to be using an x86:

http://groups.google.com/group/comp.programming.threads/msg/ca2f1af4552233df
(clever trick)

[...]
>> {
>>     acquire_fence(); // prevent subsequent stores to x from moving before
>> any previous loads
>>     x = 1;
>> }
>> And supporting
>> sufficiently fine-grained fencing is clearly messy.
[...]

I disagree; take a look at the SPARC... IMHO,  its member instruction is
fairly flexible, so it can be done. I would add only one single modifier to
it... That would be a (#LoadDepends) barrier... So I could do...

# the depends would renders the loadload a nop on most systems...
membar #LoadDepends | #LoadLoad 

You do atomicword_load_depends(...) with the previous barrier AFTER the
load, then you can support RCU/vZOOM. So I hope C++ is up to the clallenge.
Fine grain memory barrier functionality is KEY... IMHO of course...

:^)

--
http://appcore.home.comcast.net/
(portable lock-free data-structures)