Possible language changes

Wed Mar 2 18:51:09 GMT 2005

I agree that the JSR 133 quote is wrong.  (I vaguely recall discussing
this in a JSR 133 context, but the results apparently didn't make it
back.)

On IA64, with pthreads, the overhead for correct DCL in the fast path
is a .acq suffix on the load instruction, plus compiler reordering
restrictions that are unlikely to matter much.  Without DCL, you
need a pthread_mutex_lock()/pthread_mutex_unlock() pair.  In a tight
loop, the .acq costs about 5 cycles, the lock/unlock costs about 100,
which you might be able to reduce, but to nowhere near 5.  (There
currently seem to be extra memory barriers in the pthread
synchronization
primitives on Linux/IA64.  Part of the blame for that no doubt goes
to the current confusing spec, which we're trying to fix.  Our
performance impact probably won't be all negative.)

Hans 

> -----Original Message-----
> From: Doug Lea [mailto:dl at cs.oswego.edu] 
> Sent: Wednesday, March 02, 2005 8:32 AM
> To: Maged Michael
> Cc: Doug Lea; Andrei Alexandrescu; asharji at plg.uwaterloo.ca; 
> Ben Hutchings; Boehm, Hans; Jim Rogers; Kevlin Henney; Bill 
> Pugh; Richard Bilson; Douglas C. Schmidt
> Subject: Re: Possible language changes
> 
> 
> Maged Michael wrote:
> > 
> > I am not sure how successful compilers can be in removing 
> unnecessary
> > barriers. Will they be able to generate code with no more 
> barriers than 
> > ideal assembly.
> 
> Of course not. No optimizer can guarantee to optimize any 
> kind of code to ideal assembly. But they can usually do much 
> better than most programmers. There haven't been PLDI-ish 
> papers out yet exploring barrier optimizations but you figure 
> there will be. For example, hotspot only does 
> StoreLoad-squashing within basic blocks (which gets most of 
> them in practice), because I didn't know how to do this as a 
> full dataflow. Hopefully some of the people I've suggested 
> pursue this will do so.
> 
> In any case, as I said, if you do need full control, the 
> atomics classes should give you everything you need. And all 
> of the responsibilities for getting it right!
> 
> > 
> > About DCL, I quote the following from the JSR 133 FAQ: 
> > 
> http://www.cs.umd.edu/users/pugh/java/memoryModel/jsr-133-faq.html#dcl
> > 
> > "However, for fans of double-checked locking (and we really 
> hope there
> > are none left), the news is still not good. The whole point of 
> > double-checked locking was to avoid the performance overhead of 
> > synchronization. Not only has brief synchronization gotten 
> a LOT less 
> > expensive since the Java 1.0 days, but under the new memory 
> model, the 
> > performance cost of using volatile goes up, almost to the 
> level of the 
> > cost of synchronization. So there's /still /no good reason to use 
> > double-checked-locking."
> > 
> > I don't know if this is an accurate assessment of the cost 
> of volatile
> > or not after taking compiler optimizations (removing barriers) into 
> > account. In any case, my opinion is that C++ should allow at least 
> > simple things like DCL to be as efficient as being written 
> in assembly. 
> > Do we agree on that?
> > 
> 
> I probably ought to ask Jeremy and Brian to rewrite that.
> Here's what really happens for
> 
> class Singleton {
>     static private volatile Singleton instance;
>     public Singleton get() {
>       Singleton s = instance;
>       return (s != null)? s : init();
>     }
>     private Singleton init() {
>        // Use either locks or CAS ...
>     }
> }
> 
> Here, the volatile read in get() costs only lost compiler reorderings
> compared to a non-volatile read. Which is essential here no matter
> how you do it.
> 
> If you do the initialization with a CAS rather than a lock,
> then you pay at least one CAS, maybe more if you need to retry.
> (CAS approach in general only works when initialization has
> no side effects).
> 
> If you do it with a lock, you need a CAS for lock, plus
> the volatile write (generating a StoreLoad barrier) plus the unlock,
> which normally requires another CAS or barrier. The StoreLoad 
> associated
> with volatile followed by either CAS or barrier is one of the "easy"
> cases, so the JVM will always elide it out.
> 
> The net effect is as fast as I know how to do this in assembler.
> Compared to full locking, DCL saves you two expensive instructions
> (CAS or StoreLoad) per call to get(). So will be noticeably cheaper
> than using locks if this code is exercised much. (Still, there are
> many cases in Java where approaches like the dynamically loaded
> static trick are faster.)
> 
> All of this is true at least on x86 and sparc.
> 
> On IA64 and PPC there are different optimizations that apply here
> that I don't think anyone has done out yet inside JVMs. (I sure
> hope the IBM folks are working on this though.)
> 
> Anyway, for C++, I think the only issue here is whether it is
> worth weakening semantics of volatile writes compared to java
> so that IA64 can ALWAYS use st.rel rather than mf rather than
> only doing so as an optimization. As Hans and I discussed (a year
> or two ago) the optimizations are hard to apply, yet the actual
> need for a full mf is rare. I think there are similar issues for
> PPC but I've stopped pretending I know anything about PPC barriers
> any more because people keep telling me different alleged facts
> about them.
> 
> -Doug
> 
> 
>