[cpp-threads] A question about N2153

Sat Jan 20 20:12:40 GMT 2007

On Sat, Jan 20, 2007 at 12:22:16AM -0800, Chris Thomasson wrote:
> >On Wed, Jan 17, 2007 at 09:00:02PM -0800, Chris Thomasson wrote:
> >>----- Original Message -----
> >>From: "Chris Thomasson" <cristom at comcast.net>
> >>To: "C++ threads standardisation" <cpp-threads at decadentplace.org.uk>;
> >><paulmck at linux.vnet.ibm.com>
> >>Sent: Wednesday, January 17, 2007 4:09 PM
> >>Subject: Re: [cpp-threads] A question about N2153
> 
> [...]
> >>>>>load_depends == *#LoadDepends | #LoadLoad
> >>>
> >>>>Ummm...  On all CPUs other than Alpha, you don't need -any- fencing,
> 
> [...]
> >>>>ordering on data dependencies.
> >>>I know... Of course load_depends would be a NOP on everything except
> >>>Alpha.
> >>Okay... Let me just sum up how I would like the new and improved version
> >>of
> >>C++, or whatever...
> >>To do RCU, well, you do can do the barriers like this:
> >><pseudo c++ code>
> 
> [...]
> >So "n:(#StoreStore)->next = gs:(#Naked).front" is the same as
> >"n->next = gs.front; smp_wmb()"?
> 
> Yes. That is:
> 
> n->next = gs.front;
> membar #StoreStore;

OK, good.

> >In the Linux kernel, we use rcu_assign_pointer(), which is a cpp macro
> >defined in terms of the architecture-dependent smp_wmb().  So, if I
> >understand the above code, in the Linux kernel, one would have the
> >following for the last two assignments:
> >
> >n->next = gs.front;
> >rcu_assign_pointer(gs.front, n);
> 
> Yes, well, as long as 'rcu_assign_pointer(...)' executed a #StoreStore
> barrier BEFORE it touches gs.front... Does rcu_assign_pointer only execute a
> #StoreStore? It does not add a #LoadStore? Well, then IMHO, perhaps you
> should have two variants:
> 
> rcu_assign_pointer_mb_storestore(...)
> 
> And:
> 
> rcu_assign_pointer_mb_loadstore_storestore(...)
> 
> ?

rcu_assign_pointer() does indeed place the memory barrier before the
assignment.  But it is only a StoreStore.

Why does the Linux kernel need a LoadStore|StoreStore variant?
As far as I know, all the other CPUs care about is seeing the
pointer assignment -after- the initialization of the pointed-to
data structure.

> >We used to use explicit memory barriers, but found that the above was
> >much easier for people to get right.
> 
> ;^). Well, I have to admit that I also like to wrap up the membar in the
> actual load or store function... Well, take a look at the last 10 or so
> function names/implementations in this code:
> 
> http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_gcc_asm.html

Even better when one can wrap it into higher-level primitives,
for example, in the list_for_each_entry_rcu() macro that traverses
RCU-protected linked lists.  This approach allows the kernel hacker to get
correct code produced with extreme economy of detailed CPU-architecture
knowledge.  ;-)

> x86 is coarse membar granularity, so I can postfix the function names with
> 'fence', or 'naked'... Not to functional, however I abstract into a much
> more granular API here, near the bottom of the following include file for
> the above assembler code:
> 
> http://appcore.home.comcast.net/appcore/include/cpu/i686/ac_i686_h.html
> 
> 'fence'
> 'acquire'
> 'release'
> 'depends'
> 'naked'
> 
> So, for my AppCore API, to do a RCU reader well, you do this:
> 
> 
> * please note that rcu_read_lock/unlock are not needed in a user-space RCU
> implementation... Pre-emption can be addressed several ways... anyway...

For one example, see:

	http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf

for descriptions of variants of rcu_read_lock() and rcu_read_unlock()
that permit preemption.

Believe me, once you end up with a couple hundred RCU read-side critical
sections in your program, you are going to -really- be hurting for
rcu_read_lock() and rcu_read_unlock() -- just to keep track of where
the critical sections are!  I suppose that one might be able to
reverse-engineer them from rcu_dereference(), but this can be quite
painful when you have an RCU read-side critical section spread over
five or ten levels of function call.  ;-)

> void reader_thread(...) {
>  node *n = ac_mb_loadptr_depends(&gs.front);
>  while (n) {
>    node *nx = ac_mb_loadptr_depends(&n->next);
>    n->const_function(...);
>    n = nx;
>  }
> }

I still like rcu_dereference() better -- much better at giving the poor
slob reading the code some idea as to what is going on.  Admittedly,
this criterion applies more to open-source software than to other types,
but it can still be quite valuable should you be required to get back
into some code you wrote a couple decades prior.  ;-)

> >But see below.
> [...]
> 
> >In the Linux kernel, one would do something like the following:
> [...]
> 
> >The rcu_dereference() macro is defined in terms of the
> >architecture-dependent smp_read_barrier_depends() primitive.
> >Again, we used to use explicit memory barriers, but found that the
> >above was much easier for people to get right -- and much easier
> >to build tools to check for correct usage (see Josh Triplett's
> >RCU additions to Linux's "sparse" checker).
> >
> >So, am I advocating hiding memory barriers completely?  No way!!!
> 
> :^)
> 
> >People building things like RCU infrastructure and many other things
> >need explicit memory barriers in order to get their job done.  However,
> >if such people are wise, they will define a clean API that does not
> >expose explicit memory barriers to their users.
> 
> That's what I did with AppCore:
> 
> http://appcore.home.comcast.net/
> 
> http://appcore.home.comcast.net/ac_src_index.html
> 
> Far from perfect, but at least it does abstract the barriers away 'fairly'
> well...

Sounds good!

> >>so, the reader-side has exactly 0 memory barriers on every current system
> >>out there except the alpha.
> >
> >Very good!
> 
> Very good Indeed!   ;^)

;-)

> >>                            Also, its weak enough to express just a
> >>normal
> >>#StoreStore inside the writers critical section that is guarded by the
> >>stack
> >>objects associated mutex... I would kind of like it if C++ would copy
> >>from
> >>the SPARC model... Just my humble opinion of course...
> >
> >I must confess ignorance of your history, but if you like SPARC, you
> >like SPARC.
> 
> Yeah. I am biased toward the SPARC... Well, its membar instruction is so
> versatile you can realize highly granular memory barrier operations with
> it... That's a plus is my book... Oh well...

You have to keep in mind that I distinctly remember the 32-bit CPU wars
between the 68000, the NS32032, and the Z80000.  The 80286 beat all
three, despite its being much reviled in almost all quarters, to say
nothing of its not even being a true 32-bit machine.  ;-)

> >The Linux kernel follows DEC Alpha, but adds smp_rmb(),
> >smp_read_barrier_depends(), and so on.
> 
> So, code that makes use of such primitives on Linux can be considered a
> fairly portable or what? IMHO, I would fully expect the API's in question to
> be classified under a so-called 'systems-level', aka; subject to possible
> modifications? I must admit that when I am on Linux, I don't make direct use
> of what I consider to be system-level API's... So, raw access to futexs,
> atomic_xxx, and rcu_xxx api's are something I avoid... Instead, I define a
> target architecture, create the supporting assembly language for my AppCore
> Library, and make use of my own API's for, lets say,  lock-free
> programming... It eases my problem with paranoia... You know, I use
> system-level api, or crap a service pack changed something... Now, my apps
> are rendered useless on the 'new' stuff...

The Linux-kernel memory-ordering primitives such as smp_rmb() can indeed
be used in code common to the 20+ CPU families that Linux runs on.
Each CPU type must supply its own definition of these primitives.

Portability is a key concern in the Linux community!

That said, these primitives are intended for use in the kernel, though
I do sometimes cut-and-paste the definitions into GPL-licensed user-mode
code.  And I doubt that I am the only one...

> :O

						;-), Paul