[cpp-threads] Re: Increment/decrement operators on atomics package

Mon Apr 30 21:38:06 BST 2007

cpp-threads-bounces at decadentplace.org.uk wrote on 04/27/2007 07:05:37 PM:

> > From:  Raul Silvera
> >
> >
> > Particularly for PPC, SC only requires fences between each
> > pair of memory accesses, while RMW operations require a
> > load-reserve/store-conditional loop.
> >
> My initial mental model was that the cost of RMW operations, even on
> PowerPC, stems almost exclusively from the fences.  I think I was
> corrected at some point, and it was pointed out that this isn't entirely
> true.  But it still seems to me that load-reserve/store-conditional
> should inherently be cheap for the hardware; it's basically just setting
> a processor register on the load-reserve, and checking it on the
> store-conditional, right?

I am more concerned about architectures where RMW operations are
implemented
in software by locks than those that have hardware support. However, even
for hardware implementations, the critical section being implemented
will consume hardware resources which may be scarce. Some of these
resources
are shared between RMW operations that operate on different memory
locations.

> Is my model wrong, so that the load-reserve/store-conditional actually
> account for much of the cost?  If so, is there a real reason for that,
> beyond the usual problem that RMW operations are infrequent in standard
> benchmarks?
>
> Presumably the fact that there's a loop involved is uninteresting, since
> it should generally only be executed more than once in the contention
> case, in which the user presumably really meant the update to be atomic.
> And in the other case, the branch in the loop is perfectly predictable.

Running trivial kernel loops on PPC show an overhead of RMW over regular
updates of 30% (ordered accesses). Measuring relaxed operations I got an
overhead of ~90%. This is on a very quiet system, though, so it may not
be representative of general usage.

In general my concern is about hiding a lot of behavior inside these
atomics
that are not made explicit to the user. I would much rather have the
presence of
RMW semantics be made explicit through named member functions than to bury
additional semantics inside basic operators.

> I'd still like to understand why the RMW is actually appreciably slower.
>
> Hans
>

--
Raúl E. Silvera         IBM Toronto Lab   Team Lead, Toronto Portable
Optimizer (TPO)
Tel: 905-413-4188 T/L: 969-4188           Fax: 905-413-4854
D2/KC9/8200/MKM