Possible language changes

Wed Mar 2 16:31:37 GMT 2005

Maged Michael wrote:
> 
> I am not sure how successful compilers can be in removing unnecessary 
> barriers. Will they be able to generate code with no more barriers than 
> ideal assembly.

Of course not. No optimizer can guarantee to optimize any kind of code
to ideal assembly. But they can usually do much better than most
programmers. There haven't been PLDI-ish papers out yet exploring
barrier optimizations but you figure there will be. For example,
hotspot only does StoreLoad-squashing within basic blocks (which
gets most of them in practice), because I didn't know how to do this
as a full dataflow. Hopefully some of the people I've suggested
pursue this will do so.

In any case, as I said, if you do need full control, the atomics
classes should give you everything you need. And all of the
responsibilities for getting it right!

> 
> About DCL, I quote the following from the JSR 133 FAQ:
> http://www.cs.umd.edu/users/pugh/java/memoryModel/jsr-133-faq.html#dcl
> 
> "However, for fans of double-checked locking (and we really hope there 
> are none left), the news is still not good. The whole point of 
> double-checked locking was to avoid the performance overhead of 
> synchronization. Not only has brief synchronization gotten a LOT less 
> expensive since the Java 1.0 days, but under the new memory model, the 
> performance cost of using volatile goes up, almost to the level of the 
> cost of synchronization. So there's /still /no good reason to use 
> double-checked-locking."
> 
> I don't know if this is an accurate assessment of the cost of volatile 
> or not after taking compiler optimizations (removing barriers) into 
> account. In any case, my opinion is that C++ should allow at least 
> simple things like DCL to be as efficient as being written in assembly. 
> Do we agree on that?
> 

I probably ought to ask Jeremy and Brian to rewrite that.
Here's what really happens for

class Singleton {
    static private volatile Singleton instance;
    public Singleton get() {
      Singleton s = instance;
      return (s != null)? s : init();
    }
    private Singleton init() {
       // Use either locks or CAS ...
    }
}

Here, the volatile read in get() costs only lost compiler reorderings
compared to a non-volatile read. Which is essential here no matter
how you do it.

If you do the initialization with a CAS rather than a lock,
then you pay at least one CAS, maybe more if you need to retry.
(CAS approach in general only works when initialization has
no side effects).

If you do it with a lock, you need a CAS for lock, plus
the volatile write (generating a StoreLoad barrier) plus the unlock,
which normally requires another CAS or barrier. The StoreLoad associated
with volatile followed by either CAS or barrier is one of the "easy"
cases, so the JVM will always elide it out.

The net effect is as fast as I know how to do this in assembler.
Compared to full locking, DCL saves you two expensive instructions
(CAS or StoreLoad) per call to get(). So will be noticeably cheaper
than using locks if this code is exercised much. (Still, there are
many cases in Java where approaches like the dynamically loaded
static trick are faster.)

All of this is true at least on x86 and sparc.

On IA64 and PPC there are different optimizations that apply here
that I don't think anyone has done out yet inside JVMs. (I sure
hope the IBM folks are working on this though.)

Anyway, for C++, I think the only issue here is whether it is
worth weakening semantics of volatile writes compared to java
so that IA64 can ALWAYS use st.rel rather than mf rather than
only doing so as an optimization. As Hans and I discussed (a year
or two ago) the optimizations are hard to apply, yet the actual
need for a full mf is rare. I think there are similar issues for
PPC but I've stopped pretending I know anything about PPC barriers
any more because people keep telling me different alleged facts
about them.

-Doug