[cpp-threads] Tremblant mailing

Fri Aug 26 05:40:41 BST 2005

I attempted to send out the draft of our submission for the
pre-Mont-Tremblant mailing.  Unfortunately the mailing list robot
complained about the size of the pdf file.

Here are the LaTeX and bib files again.  I will see if I can make the pdf
available someplace tomorrow.

The author list is currently unchanged from the last time, which is wrong.
I would like to add all contributors to the discussion who do not object
to the content.  Please let me know if you would like to be added or
removed.

Please also send any comments, corrections, or additions.  Patches
(diff output) for the tex file are preferred but not mandatory.

The official deadline is tomorrow (Friday).  We can stretch that
until Sunday, if necessary.

Thanks

Hans
-------------- next part --------------
\documentclass{article}
\bibliographystyle{plain} 

%Check if we are compiling under latex or pdflatex
\ifx\pdftexversion\undefined
  \usepackage[dvips]{graphics}
\else
  \usepackage[pdftex]{graphics}
\fi
\usepackage{url, listings, color}

\sloppy

 \definecolor{CadetBlue}{cmyk}{0.62,0.57,0.23,0}
\DeclareFontShape{OT1}{cmtt}{bx}{n} {
  <5><6><7><8><9><10><10.95><12><14.4><17.28><20.74><24.88>cmttb10
}{}

\lstset{
  language=C++,
  columns=flexible,
  keepspaces=true,
  basicstyle=\tt,
  keywordstyle=\bfseries,
  escapechar=`,
}

\let\verb=\lstinline

\newcommand{\Rplus}{%
  \protect\nolinebreak\hspace{-.07em}%
  \protect\raisebox{.25ex}{\small\textbf{+}}%
}
\newcommand{\Cpp}{C\Rplus\Rplus}

%\renewcommand{\Cpp}{C\nolinebreak\hspace{-.05em}\raisebox{.4ex}{\tiny\bf +}\nolinebreak\hspace{-.10em}\raisebox{.4ex}{\tiny\bf +}}
 % \renewcommand{\Cpp}{{C\nolinebreak[4]\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}

\begin{document}

%\title{
\begin{flushright}
\begin{tabular}{rl}
Document Number: & WG21/N1876=J16/05-0136 \\
Date: & 2005-08-26 \\
Reply to: & Hans Boehm \\
 & \url{Hans.Boehm at hp.com} \\
 & 1501 Page Mill Rd., MS 1138 \\
 & Palo Alto CA 94304 USA \\
\end{tabular}

\end{flushright}

\begin{center}
\addvspace{20pt}

{\LARGE Memory model for multithreaded C++: August 2005 status update}

\addvspace{20pt}

{\large \hfill Andrei Alexandrescu \hfill Hans Boehm \hfill Kevlin Henney \hspace*\fill

\addvspace{5pt}

\hfill Ben Hutchings \hfill Doug Lea \hfill Bill Pugh \hspace*\fill}

\end{center}

%\date{August 26, 2005}

%\maketitle

\begin{abstract} 
The C++ Standard defines single-threaded program execution. Fundamentally,
multithreaded execution requires a much more refined memory and execution
model. C++ threading libraries are in the awkward situation of specifying
(implicitly or explicitly) an extended memory model for C++ in order to specify
program execution. We have been working on integrating a memory model
suitable for multithreaded execution into the C++ Standard.

This document is a
continuation of N1680=04-0120 and N1777=05-0037.  It describes
some of the issues we have discussed and tried to address since
the Lillehammer meeting and our current direction.  Its main goal is to
ensure that this direction remains consistent with the views of the
committee.

It has turned out that the more challenging problems are
imposed on us by our desire to go beyond pthread-like primitives,
and support a general purpose atomic operations library, which
allows the construction of lock-free programs which efficiently
utilize primitives most hardware provides for that purpose.
Hence we will begin with our current thinking about this
library.

\end{abstract}

\section{Atomic Operations Library}

We had assumed from the start that thread support for C++ would
have to include access to low level atomic operations.

These
operations allow the development of lock-free code, a notoriously
difficult task, but one that is often necessary for high
performance implementations of some low level functionality.
These are primitives aimed at expert programmers.  Most of their
uses would be hidden in other libraries.

It has
become increasingly clear that the diversity of operations in this
library has a profound effect on the memory model.  The relatively
simple approaches we were considering earlier do not allow us to
describe some of the primitives we would like to provide here.\footnote{
The simplest approaches also don't allow us to describe certain lock
semantics.  We would like to allow locking primitives that allow memory
operations to be moved into, but not out of, critical sections, as in
Java.  A simple specification along the lines of the current Pthread spec would
impose additional restrictions and cost.}

\subsection{Atomics library approach}

We have been discussing a library design in which primitives atomically
read and/or update a memory location, and may optionally provide some
memory ordering guarantees.

Nearly all lock-free algorithms require both atomic, i.e.
indivisible, operations on certain pieces of data, as well as a
mechanism for specifying ordering constraints.  Allowing them
to be combined in a single operation (as opposed to providing
separate atomic operations and ``memory barriers'') makes it
possible to cleanly and precisely express the programmer's intent,
avoids some unnecessary constraints on particularly compiler reordering,
and makes it easy to take advantage of hardware primitives
that often combine them.\footnote{
For example, many architectures provide an atomic exchange or
compare-and-swap primitive that simultaneously enforces memory ordering.
Itanium provides load and store instructions that ensure ordering
similar to what we describe below.}

For example, atomically updated integers would provide operations
such as

\begin{description}

\item[load]
Atomically read the value of the integer.

\item[store]
Atomically replace the value of the integer.

\item[fetch and add]
Atomically add a value to the integer.

\item[store with release ordering semantics]
Atomically replace the value of the integer, and ensure that all
prior memory operations executed by this thread become visible
to other threads before the update.

\item[load with acquire ordering semantics]
Atomically load the value of the integer, and ensure that all later
memory operations performed by this thread become visible after the
load.

\end{description}

The last two together can be used to correctly implement a flag denoting,
for example, that an object has been properly initialized.  The ordering
semantics ensure that the initialization operations performed by
one thread before it sets the flag (using release ordering) are visible
to another thread which reads a set flag (using acquire semantics), and
then expects to see an initialized object.

On the vast majority of existing X86 processors, the ``load'' and ``load with
acquire ordering semantics'' primitives could use the same hardware
implementation, and would merely impose different compiler constraints.
Other processor architectures are split as to whether they require
separate implementations.  The same applies to ``store'' and
``store with release semantics''.

We expect that in the final version of the library, operations such
as load and store will be parameterized in some form with an ordering
constraint (such as ``release'', ``acquire'', or ``none'').

A current, still very rough, draft of the atomics library interface
is now available \cite{atomics_prop}.

\subsection{Ordering constraints}

Perhaps the most contentious aspect of the design of the atomics
library has turned out to be the set of ordering constraints.
This matters for several reasons:

\begin{enumerate}
\item Lock-free algorithms often require very limited and specific
constraints on the order in which memory operations become visible to
other threads.  Even relatively limited constraints such as ``release'',
may be too broad, and impose unneeded constraints on both the
compiler and hardware.

\item Different processors often provide certain very limited constraints
at small or no additional cost, where the cost to enforce something
like an ``acquire'' constraint may be more major.\footnote{Note that
this does not apply to current X86 processors and others, where even an
``acquire'' constraint imposes no additional cost.}
In particular, the vast majority of, but not all, processors ensure
that dependent
operations, e.g. loading a pointer, and then loading a value referenced
by the pointer, occur in order, even when observed by other threads.
But different processor architecture rarely, if ever, agree on what
exactly constitutes dependent operations.

\item As discussed briefly below, adding further constraints appears
to significantly complicate the memory model.
\end{enumerate}

At the moment, there is no clear consensus within our group, but we are
leaning towards a small set of ordering constraints (acquire, release,
both, or none), largely because we have been unable to find a way
to precisely specify the rest.  In particular, the other implicit guarantees
provided ``for free'' by many processor architectures are usually such
that they can be easily negated by routine compiler transformations,
sometimes even by transformations made to code in a separately
translated source file.  Thus it appears hard to take advantage
of these without resorting to essentially assembly code.

We do expect to diverge somewhat from most primitves defined
by \cite{JSR166} in allowing atomic
operations
themselves to be reordered with each other, if they have no explicit
ordering constraint to the contrary.  There is no assumption that
atomic operations by default enforce sequential consistency among themselves,
as volatiles do in Java.\footnote{Our "release" semantics correspond
roughly to {\tt java.util.concurrent.atomic lazySet} semantics.}

We expect that limiting the ordering constraints to a small set will
result in small performance losses (on the order of a few percent)
for some lock-free code relative to the assembler equivalent.  But
this should remain minor compared to the benefit that may be achieved
by lock-free code in many circumstances.

\section{Implications on the Memory Model}

\label{implications}

Our original plan was to largely preserve what might be termed the
``pthreads memory model''.  Any code that contained a ``data race'',
i.e. an assignment to a shared location that took place concurrently
with another access to that location, would have undefined semantics.
Any other code would have ``sequentially consistent''\cite{Lamport79}
semantics, i.e. could be understood as a simple interleaving of the
actions of the individual threads.

We concentrated on defining precisely what was meant by a ``data race''.
This is currently only very incompletely incompletely defined, which
gives rise to the issues discussed in \cite{Boehm05}.  This
definition was itself based on sequential consistency.

This approach has the nice property that the programmer only needs
to understand sequential consistency in order to reason
about programmers.  At the same time, the actual guarantees that
are provided are {\em much} weaker than sequential consistency, since
concurrent writes are simply disallowed.

This approach unfortunately does not work in the presence of an
atomic operations library, which explicitly allows concurrent accesses
to special atomic variables, while providing few memory ordering
guarantees.  Consider having thread {\it A} execute the following, where
initially {\tt x}, {\tt y} and {\tt z} are all zero:\footnote{
For brevity of explanation,
we use over-simplified syntax, not the one from our current
draft proposal.}

\begin{center}
\begin{minipage}{2in}
\begin{verbatim}
atomic_store(&x, 1);
r1 = atomic_load(&y);
if (!r1) ++z;
\end{verbatim}
\end{minipage}
\end{center}

while thread {\it B} executes:

\begin{center}
\begin{minipage}{2in}
\begin{verbatim}
atomic_store(&y, 1);
r2 = atomic_load(&x);
if (!r2) ++z;
\end{verbatim}
\end{minipage}
\end{center}

Under a sequentially consistent interpretation, one of the
{\tt atomic\_store} operations must execute first.  Hence {\tt r1}
and {\tt r2} cannot both be zero, and hence there is no data race
involving {\tt z}.  There are data races involving {\tt x} and
{\tt y}, but those accesses are made through the atomic operations
library, and hence must be allowed.  Atomic accesses are not meaningful
if there is no data race.

The difficulty is that there are strong reasons to support
variants of {\tt atomic\_store} and {\tt atomic\_load}
that allow them to be reordered, i.e. that allow the {\tt atomic\_load}
to become visible to other threads before the {\tt atomic\_store}.
For example, preventing this reordering on some common X86 processors
incurs a penalty of over 100 processor cycles in each thread.
Both the ordinary load and store operations, as well as the acquire
and release versions from the preceding section, will allow this
reordering.

For
variants that allow reordering, the above program should really invoke
undefined semantics, since {\tt r1} and {\tt r2} can both
be zero, and hence there is a data race on the ordinary variable
accesses to {\tt z}.

This means that we need to use a definition of data race that
incorporates the reordering constraints imposed by atomic operations.

Our current (still very rough) draft proposal \cite{mm_prop}
requires sequential consistency for ordinary variable
accesses, but effectively allows reordering of atomic variable accesses
{\em when determining the existence of a data race}, i.e. it tries
to adapt the original solution.  (The basic approach we use
for atomics in the definition of a data race is similar to the
notion of happens-before consistency\cite{JSR133web}.)

So far, this approach seems to have the right properties.  It has the
disadvantage that we are essentially exploring new ground, and hence
have relatively little confidence that there are no unforeseen glitches.

\section{Alternative: The Java Memory Model}

An alternate approach to many of these issues would be to essentially adopt
the Java memory model.  This option is still discussed regularly, and has
not been completely dismissed, particularly if we run into further technical
difficulties with our current approach.

There are however good reasons that we have not committed to that route,
in spite of the overlap among the participants in the two efforts.
These were largely discussed in an earlier document (N1777=05-0037),
and informally agreed to by the committee.

Both for completeness, and because they have been explored a bit deeper
by our group in the meantime, we again summarize them here:

\begin{itemize}
\item The Java approach is required to ensure type-safety, and more
particularly, to ensure that no load of a shared pointer can ever result in a
pointer to an object that was constructed with a different type.
C++ does not have this property to start with, and hence we are free to
continue to violate it.
\item This would be different from the approach taken by pthreads
(and probably implicitly by other thread libraries) which give undefined
semantics to any code involving a data race.
\item The Java approach requires ordinary reads to be atomic for certain
data types, including pointers. We expect this may be problematic on
some embedded hardware with narrow memory buses or weak alignment
constraints.
\item The Java approach requires a complex model to deal with causality.
We are hoping to make that unnecessary for the core language, and to specify
the atomics library less formally. This is discussed in the next section.
\item The current pthreads approach allows some compiler transformations
that are not legal in Java, but probably moderately common in the context
of C++. These produce unexpected results in the presence of data races.
But given the pthreads approach of outlawing data races, they appear benign,
and can improve performance. In particular, assuming we have no
intervening synchronization actions, compilers currently may:
\begin{enumerate}
\item ``Rematerialize'' the contents of a spilled register by reloading its
value from a global, if single-threaded analysis shows that the global may
not have been modified in the interim. This is only detectable by programs
that involve a data race. (This means that r1 == r1 may return false if
r1 is a local integer loaded via a data race. That's OK since the program
has undefined behavior.)
\item Fail to provide explicit guarantees about visibility of vtable
pointer initialization.  Current rules do not allow an object to be
communicated to another thread without explicit synchronization.
We expect that on some architectures, this avoids an expensive memory
barrier.
\item Introduce redundant writes to shared globals. For example, if we are
optimizing a large synchronization-free switch statement for space, all
branches but one contain the assignment {\tt x = 1;}, and the last one contains
{\tt x = 2;} we can, under suitable additional assumptions, put {\tt x = 1;}
before the switch statement, and reassign x in the one exceptional case.
\item Insert code to do the equivalent of
{\tt if (x != x) <{\it start rogue-o-matic}>} at any point for any global x
that is already read without intervening synchronization. That's perhaps
not very interesting. But it does point out that analysis in current compilers
is based on the assumption that there are no concurrent modifications to
shared variables. We really have no idea what the generated code will do
if that assumption is violated. And the behavior cannot necessarily be
understood by modeling reads of globals to return some sort of
nondeterministic value. 
\end{enumerate}
None of these transformations are legal in Java. Variants of all of them
appear potentially profitable and/or useful as part of a debugging facility.
\end{itemize}

\section{Causality issues}

We are hoping to largely avoid the rather complex treatment of
causality which forms the basis of the Java memory model.  It currently
appears that we can do so for the core language, but that this is not
entirely possible in the presence of the atomics library.

Consider a modification of the example from section~\ref{implications}.
Again all variables are initially zero.  Thread {\it A} executes:

\begin{center}
\begin{minipage}{4in}
\begin{verbatim}
r1 = atomic_load(&y);
atomic_store(&x, r1);  // Effectively assigns y to x
if (r1 == 42) ++z;
\end{verbatim}
\end{minipage}
\end{center}

while thread {\it B} executes:

\begin{center}
\begin{minipage}{4in}
\begin{verbatim}
r2 = atomic_load(&x);
atomic_store(&y, r2);  // Effectively assigns x to y
if (r2 == 42) ++z;
\end{verbatim}
\end{minipage}
\end{center}

This contains a data race, and thus has undefined semantics, if and
only if {\tt r1} = {\tt r2} = 42.  Most programmers would be surprised
by this result.  However, since the {\tt atomic\_load} and
{\tt atomic\_store} operations performed by each thread may become
visible to the other thread out of order, in the absence of some
causality requirement this becomes possible.  Each thread's
{\tt atomic\_load} operation sees a value of 42, which it
then stores in the other variable, justifying the result read by the
other thread.

In fact, ignoring the atomicity requirements,
there are sequentially correct compiler transformations that
could produce this outcome.  Me might have profiled this code
under different circumstances and discovered that {\it x}
and {\it y} are usually 42.  The compiler may then have transformed
what is effectively the assignment {\tt x = y;} in thread {\it A}
to

\begin{center}
\begin{minipage}{2in}
\begin{verbatim}
x = 42; if (y != 42) x = y;
\end{verbatim}
\end{minipage}
\end{center}

(In isolation, this is clearly silly.  In more complicated contexts, this
might conceivably make sense for instruction scheduling reasons, e.g.
because {\tt y} is being prefetched, and this gives the prefetch
more time to complete.)

We would like to prohibit this kind of transformation for atomics.
(It is not an issue for ordinary variable assignments, since it
can only arise in the presence of a data race, and that is effectively
disallowed.)

Stating precisely what it means to disallow this would require something
functionally similar to the Java treatment of causality\cite{JSR133web},
which we would like to avoid for reasons of simplicity.

Since we expect that this kind of aggressive transformation on calls to
the atomics library is unlikely in any case, we instead propose to address
it with a less precise statement, to the effect that speculative
execution of primitives in the atomics library is disallowed.\footnote{
This would probably not be acceptable for ordinary variables, since it gives
rise to some borderline cases, which really need to be resolved.
But we feel that in the case of atomics, it is perfectly acceptable to
resolve any such questions conservatively, since any performance
penalty should be negligible.  In the very unlikely case that serious
questions arises in this area with atomics, the Java spec could serve as
a guide for resolving them.}

\section{Conclusions}

We believe that we are making progress on the specification of both a
memory model and an atomic operations library for C++.

It has become increasingly clear that any specification will have
an unavoidable impact on compiler optimization.\footnote{We have some new
examples of this.  See slide 14 in
\url{http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/pldi05_threads.pdf}}
Some currently common compiler optimizations need to be adapted to
ensure thread safety.  But this also reinforces the urgency for
thread support in C++:  Current implementations make it much harder
than it should be to write correct multithreaded code.

Our progress has been slowed by both the technical difficulties of
defining a memory model that is compatible with a high performance
atomics library, and by disagreements about the atomics library itself.

We believe that our current approach is based on sound reasoning,
and that there is no easy way to sidestep the technical difficulties.
It does still appear that we are heading for a solution whose complexity
is effectively limited to the atomics library, and hence will not
affect a programmer writing data-race-free programs that use locks
for synchronization.  We expect most programmers to fall into this category.

Feedback on our direction is greatly appreciated.

\bibliography{multithreading}

\end{document}
-------------- next part --------------
@book{pthreads,

    author = "IEEE Standard for Information Technology",

    title = {Portable Operating System Interface (POSIX) --- System Application Program Interface (API) Amendment 2: Threads Extension (C Language)},

    publisher = "ANSI/IEEE 1003.1c-1995",

    year = 1995

    }

@unpublished{JSR133,

  author = "Tim Lindholm et al.",

  title = "{Java Specification Request 133: Memory Model and Thread Specification Revision}",

  note = "Available at \url{http://www.jcp.org/jsr/detail/133.jsp}",

 }

@MISC{JSR133web,

        author = "{JSR 133 Expert Group}",

        title = "JSR-133: Java Memory Model and Thread Specification",

	year = "2004",

	month = "August",

        howpublished ={\url{http://www.cs.umd.edu/~pugh/java/memoryModel/jsr133.pdf}}

}

@unpublished{JSR166,

  author = "Doug Lea et al.",

  title = "{Java Specification Request 166: Concurrency Utilities}",

  note = "Available at \url{http://www.jcp.org/jsr/detail/166.jsp}",

 }

@article{maddj,

  author = "Scott Meyers and Andrei Alexandrescu",

  title = "{C++ and The Perils of Double-Checked Locking}",

  year = 2004,

  month = "Jul",

  journal = "Doctor Dobb's Journal", 

}

@article{herlihy,

 author = {Maurice Herlihy},

 title = {Wait-free synchronization},

 journal = {ACM Trans. Program. Lang. Syst.},

 volume = {13},

 number = {1},

 year = {1991},

 issn = {0164-0925},

 pages = {124--149},

 doi = {http://doi.acm.org/10.1145/114005.102808},

 publisher = {ACM Press},

 }

@Article{Lamport79,

	author = "Leslie Lamport",

	title = "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs",

	journal = "IEEE Tranactions on Computers",

	year = {1979},

	month = "September",

	volume = {C-28},

	number = {9},

	pages = "690-691"

}

@MISC{atomics_prop,

   author = 	"Boehm et al.",

   title = 	"A very Preliminary Proposal for the Atomics Package Interface",

   howpublished = "\url{http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/atomics.h.txt}"

}

@MISC{mm_prop,

   author = 	"C++ Memory Model Working Group",

   title = 	"A Memory Model for C++: Strawman Proposal",

   howpublished = "\url{http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/mm.html}"

}

@unpublished{dclpIsBroken,

    author = "Bacon, David and Bloch, Joshua and Bogda, Jeff and Click,

              Cliff and Hahr, Paul and Lea, Doug and May, Tom and

              Maessen, Jan-Willem and Mitchell, John D. and Nilsen,

              Kelvin and Pugh, Bill and Sirer, Emin Gun",

    title = "The ``{Double-Checked Locking Pattern is Broken}'' {D}eclaration",

    note = "Available at \url{http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html}",

}

@unpublished{improvedOpts,

  title = "{Toward Improved Optimization Opportunities in C++0X}",

  author = "Walter E. Brown et al.",

  year = 2004,

  month = "Jul",

  note = "Document WG21/N1664 = J16/04-0104; available at

  \url{http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1664.pdf}",}

@InProceedings{popl05,

   author =	"Jeremy Manson AND William Pugh AND Sarita Adve",

   title =	"The Java Memory Model",

   booktitle =	"Conference Record of the Thirty-Second Annual ACM Symposium

		on Principles of Programming Languages",

   year =	{2005},

   month =	"January",

}

@MISC{boehm04,

   author = 	"Hans Boehm",

   title = 	"Threads Cannot be Implented as a Library",

   howpublished = "\url{http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html}"

}

@InProceedings{boehm05,

   author = 	"Hans Boehm",

   title = 	"Threads Cannot be Implented as a Library",

   booktitle =	"Proceedings of the ACM SIGPLAN 2005 Conference on Programming

		Language Design and Implementation",

   year =	{2005},

   pages = 	{26-37},

}

@InProceedings{SastryJu,

   author =	"A. V. S. Sastry AND Roy D. C. Ju",

   title = 	"A New Algorithm for Scalar Register Promotion Based on SSA

	   	Form",

   booktitle =	"Proceedings of the ACM SIGPLAN 1998 Conference on Programming

		Language Design and Implementation",

   year =	{1998},

   pages = 	{15-25},

}

@InProceedings{SGIrp,

   author =	"Raymond Lo AND Fred Chow AND Robert Kennedy AND Shin-Ming

	   	Liu and Peng Tu",

   title = 	"Register promotion by sparse partial redundancy elimination

	   	of loads and stores",

   booktitle =	"Proceedings of the ACM SIGPLAN 1998 Conference on Programming

		Language Design and Implementation",

   year =	{1998},

   pages = 	{26-37},

}

@InProceedings{Bershad92,

   author =	"Brian N. Bershad AND David D. Redell AND John R. Ellis",

   title =	"Fast Mutual Exclusion for Uniprocessors",

   booktitle =	"ASPLOS-V: Fifth International Conference on Architectural

		Support for Programming Languages and Operating Systems",

   year =	{1992},

   month =	"October",

   pages =	"223-233"

}