[cpp-threads] Alternatives to SC

Sun Jan 14 16:47:15 GMT 2007

Peter Dimov wrote:

>
> Doug Lea wrote:
>
>> Atomics can be used to create SC constructions like locks.
>> But atomics can also be used in other ways that are not necessarily
>> SC. As it turns out, all usages on cache-coherent causally consistent
>> machines are at least nearly SC.
>
>
> Isn't it true that most usages are PC? The one exception that I know 
> of is hazard pointers. Do we really need the #StoreLoads? :-)
>
> Speaking of hazard pointers... as an aside, the two main problems with 
> turning an academic paper into a real C++ implementations are that (1) 
> most papers assume SC when giving the pseudocode, (2) most papers 
> either assume garbage collection, or don't mention the issue of memory 
> management at all. Even given SC atomics, the second problem remains. 
> Yet the discussion doesn't mention it at all. Do the proponents of "SC 
> atomics for the masses" also assume that they will only be usable in a 
> GC environment? :-) I simply don't see how a non-expert can reasonably 
> handle the memory management of a typical lock-free container, no 
> matter how sequentially consistent the atomics are.
>
Hazard pointers are a form of PDR which is a term I've been trying to 
coin.  It stands for PCOW
(Partial Copy On Write) Deferred Reclamation to refer to the memory 
management schemes
necessary for lock-free algorithms, though some will argue that 
lock-free itself is a debased
concurrency.

But there are multiple levels of usage.  One is the implementation of 
the PDR scheme.  The next level
would be the implementation of a lock-free data structure using PDR.  
And the next one after that
would be the using the api for that lock-free data structure if the 
application programmers are up
to that.  Some programmers aren't even up to using a lock based data 
structure so you could argue
as many do that we shouldn't have multi-threading at all.

There are lock-free containers already in public use, e.g 
java.util.atomic.ConcurrentLinkedList in Java.
Somebody had to implement a thread-safe PDR, i.e GC, and someone else 
has to implement the
ConcurrentLinkedList class.  And presumably there are at least some 
programmers who can deal
with the concurrency semantics of this container.

Speaking of hazard pointers, you can do them without that #StoreLoad 
membar and the release
membar required for the old copied over value of  the hazard pointer if 
it's anything other than
the null object which you shouldn't be trying to access anyway even if 
the null object is never
deallocated.  I was going to bring this up on the subject of composable 
atomics, i.e. whether
you can come up with a set of atomics that will allow you to build any 
higher level synchronization
mechanism that you can think of.

This is the gcc x86 hazard pointer load implementation which uses a 
modified RCU algorithm
to eliminate the need for memory barriers.  It runs effectively at the 
same speed as a normal
pointer load, about 10x faster than a version w/ memory barriers.

__attribute__((always_inline))
static __inline__ void  *  smrload32(void **hptr, void **src) {
    void *    ret;

    __asm__ __volatile__ (
    "mov    0(%2), %%ecx   ;\n"        // load source pointer
"1:\t"
    "mov    %%ecx, %0      ;\n"
    "mov    %%ecx, 0(%1)   ;\n"        // store into hazard pointer
    "mov    0(%2), %%ecx   ;\n"        // reload source pointer
    "cmp    %%ecx, %0      ;\n"
    "jne    1b             ;\n"
    "mov    %%ecx, 4(%1)   ;\n"        // store into hazard pointer[1]
    : "=&r" (ret)
    : "r" (hptr), "r" (src)
    : "cc", "memory", "ecx"
    );
    return ret;
}

There's a second internal hazard pointer to make make data structure 
traversal
safe, i.e. prevent nodes from being deallocated out from under you, as 
it doesn't
allow explicit copying of hazard pointers.  It's in that spot because 
that's the
most efficient place to put it as it's still in a register at that point.

The point here is the hardware is allowed to reorder memory accesses.  
In fact
you want it to do it to improve performance.  But you don't want the 
compiler to
reorder anything across the hazard pointer load logic.  I suppose you 
could do
it with unordered atomic ops with compiler pragmas to prevent the compiler
from reordering things but it's not clear pragmas like that will be in 
the atomics
api.  Can you create an atomics api that will anticipate all possible 
usages of it?
If you can create an atomics api that will anticipate all possible 
future hardware
memory models, it shouldn't be that hard I guess.

--
Joe Seigh