[cpp-threads] Update

Mon Jul 10 21:52:50 BST 2006

> -----Original Message-----
> From: cpp-threads-bounces at decadentplace.org.uk 
> [mailto:cpp-threads-bounces at decadentplace.org.uk] On Behalf 
> Of Peter Dimov
> Sent: Monday, July 03, 2006 1:51 AM
> To: C++ threads standardisation
> Subject: Re: [cpp-threads] Update
> 
> Boehm, Hans wrote:
> 
> > I made another pass over the atomic operations proposals that were 
> > previously discussed here, and turned them into N2047 
> > 
> (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html).
> 
> I like the new direction the atomic interface is taking. Now 
> we just need to drop the "high-level" part entirely and work 
> out the remaining issues in the low-level part. :-)
The high level interface does seem to buy you some useful properties:

1) Things like double-checked locking work with just a declaration
change, as they do in Java.  My impression is that some people perceive
this as close to a requirement.

2) It can be directly applied to struct of bit-fields and the like.

3) Parameterization w.r.t. ordering constraints, which can reduce the
amount of required library code.

> 
> * The main problem I see in the low-level interface is that 
> it has too many feature tests that basically support dead or 
> dying platforms. Sure, it grants us the ability to say that 
> we "support" nearly every platform, but what are the 
> practical benefits of doing so? From a practical point of 
> view, a platform either has atomics or it doesn't. One 
> difference that matters in practice is whether it has 
> double-width atomics, but the current interface doesn't 
> provide access to them.
I don't know whether platforms that don't support CAS will still be
important.  I'm concerned about both embedded processors, which I think
are sometimes weak there, and about older processor designs like PA-RISC
and SPARC V8.  Those generally still give you atomic loads and stores,
and those can be quite useful, even without CAS.

The double-width versions were there in the immediately preceding
version, and I can paste them back in.  The main difficulty is that
there are at least three different variants of some importance:

- 2 separate addresses, as in some M68K processors.  Those processors
probably aren't that interesting anymore, but with all the discussion of
transactional memory, and the comparative difficulty of general
implementations, I wonder whether something like this may still be
resurrected.
- double-width CAS.  (Intel X86-64, not in early AMD parts, as I recall.
Also X86-32, but that may be less interesting by 2010, I would hope.)
- single-width compare, double-width swap.  This is the Itanium 64-bit
solution.  It has the nice property that it suffices for most
algorithms, and is presumably simpler to implement than the previous
version.

And then there are various alignment constraints.

So I think that at the moment an attempt to standardize this will be
fairly messy to use and/or generate opposition from some vendors (and
risk guessing wrong).  Thus it's unclear to me whether we should try to
standardize something now. 

> 
> * atomic_exchange (or _swap, if you prefer) is missing.
It should probably be added, at least to the low-level interface.
> 
> * I prefer _none, _acq(uire), _rel(ease) and _full for 
> suffixes, but that's not particularly important. Slightly 
> more important is whether the unordered load/store need a 
> suffix (and whether they need _full variants). I think that they do.
I'm neutral about "full" vs "ordered".  The "raw" vs. "none" issue was
discussed in Berlin, and I think there was unanimous agreement on "raw",
or more accurately against "none".  The problem is that "none" sounds
much too benign for something that in most cases yields very surprising
and incorrect results.

The low level primitives in that paper had a typo in that load and store
should also use the "_raw" suffix for the unordered versions.

Based on a recent discussion with Herb Sutter, there should probably be
an atomic_store_full.  If all variables are atomic, and accessed with
atomic_store_full and atomic_load_acquire, I should get sequentially
consistent semantics, which would be a nice property.

> 
> * This is how I think atomic_cas needs to be specified:
> 
> bool atomic_cas( T * addr, T * oldv, T newv );
> 
> Effects: if *addr contains the value *oldv, atomically 
> updates *addr to contain newv and returns true, otherwise 
> stores *addr in *oldv and returns false. When T is not a 
> built-in type, performs a bitwise comparison.
Clearly adding documentation is good.

I have mixed feelings about making oldv a pointer.  I don't have a good
feeling as to how good compilers are at register-allocating variables
whose address has been taken.  I would guess that if this is implemented
as in-line assembly code, they won't.  I suspect that at least initially
this version will result in slower implementations, but I don't
currently have any data.

> 
> This formulation supercedes both strong CAS and weak CAS, as 
> it's allowed to fail. A strong CAS is easily built on top of 
> it by using a retry loop.
You're assuming the feature test for whether this is wait-free will
yield false if it can fail spuriously?  Otherwise you can't build
guaranteed wait-free algorithms?  Or do you not care?  (This is a
serious question:  I'm not sure if I do.)

Java JSR166 has both variants, though they're tied to different ordering
constraints.  And I'm not sure there is a lot of certainty that that was
the right decision.

> 
> Example use:
> 
>   {
>       long st = atomic_load_none( &state_ );
> 
>       for( ;; )
>       {
>           if( st & state_w_mask ) break;
>           if( atomic_compare_exchange_acq( &state_, &st, st + 
> 1 ) ) return; // got read lock
>       }
>   }
> 
> * I'd really like to see 
> atomic_decrement_<order1,nonzero>_<order2,zero>, or at least 
> atomic_decrement_release_acquire. This primitive is essential 
> for correct reference counting. The closest the current 
> proposal has is atomic_fetch_add_ordered. It may be argued 
> that it's close enough for all contemporary platforms.
I think the more precise version would be a tough sell.  My impression
is that there is already a lot of concern that the current version is
too complicated.

> 
> As an aside, does anyone know "the suffix" of the Solaris 
> atomics, in other words, their memory synchronization properties?
> 
> http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/s
> ys/atomic.h
> http://cvs.opensolaris.org/source/xref/on/usr/src/common/atomic/
> 
> http://docs.sun.com/app/docs/doc/816-5168/6mbb3hr32?a=view
I would guess it's what the hardware gives you, which I think depends on
the processor mode.  For TSO, I think you get ordered semantics.

Hans