pre-LilleHammer mailing deadline

Sat Mar 5 13:25:44 GMT 2005

I expanded out the section on volatiles, and did a couple of
other light edits.

See attached file "diffs" for changes. (Sorry to hit M-q so
much to rewrap paragraphs which makes it hard to see what changed.)

-Doug

-------------- next part --------------
\documentclass{article}
\bibliographystyle{plain} 

%Check if we are compiling under latex or pdflatex
\ifx\pdftexversion\undefined
  \usepackage[dvips]{graphics}
\else
  \usepackage[pdftex]{graphics}
\fi
\usepackage{url, listings, color}

\sloppy

 \definecolor{CadetBlue}{cmyk}{0.62,0.57,0.23,0}
\DeclareFontShape{OT1}{cmtt}{bx}{n} {
  <5><6><7><8><9><10><10.95><12><14.4><17.28><20.74><24.88>cmttb10
}{}

\lstset{
  language=C++,
  columns=flexible,
  keepspaces=true,
  basicstyle=\tt,
  keywordstyle=\bfseries,
  escapechar=`,
}

\let\verb=\lstinline

\newcommand{\Rplus}{%
  \protect\nolinebreak\hspace{-.07em}%
  \protect\raisebox{.25ex}{\small\textbf{+}}%
}
\newcommand{\Cpp}{C\Rplus\Rplus}

%\renewcommand{\Cpp}{C\nolinebreak\hspace{-.05em}\raisebox{.4ex}{\tiny\bf +}\nolinebreak\hspace{-.10em}\raisebox{.4ex}{\tiny\bf +}}
 % \renewcommand{\Cpp}{{C\nolinebreak[4]\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}

\begin{document}

%\title{
\begin{flushright}
\begin{tabular}{rl}
Document Number: & WG21/N1770=J16/05-0037 \\
Date: & 4 March 2005 \\
Reply to: & Hans Boehm \\
 & \url{Hans.Boehm at hp.com} \\
 & 1501 Page Mill Rd., MS 1138 \\
 & Palo Alto CA 94304 USA \\
\end{tabular}

\end{flushright}

\begin{center}
\addvspace{20pt}

{\LARGE Memory model for multithreaded C++: Issues}

\addvspace{20pt}

{\large \hfill Andrei Alexandrescu \hfill Hans Boehm \hfill Kevlin Henney \hspace*\fill

\addvspace{5pt}

\hfill Doug Lea \hfill Bill Pugh \hspace*\fill}

\end{center}

%\date{March 4, 2005}

%\maketitle

\begin{abstract} 
The C++ Standard defines single-threaded program execution. Fundamentally,
multithreaded execution requires a much more refined memory and execution
model. C++ threading libraries are in the awkward situation of specifying
(implicitly or explicitly) an extended memory model for C++ in order to specify
program execution. We propose integrating a memory model suitable for
multithreaded execution in the C++ Standard.  This document is a
continuation of N1680=04-0120, and outlines some of the more
fundamental issues we have encountered.  In particular, we desire initial
feedback on whether we should prepare for a future type-safe subset
of the language, or ignore type-safety considerations for now

\end{abstract}

\section{Introduction} 

Many of today's applications make use of multithreaded execution. We
expect such use to grow as the increased use of hardware
multithreading (a.k.a. ``hyperthreading'') and multi-core processors
will force or entice more and more applications to become
multithreaded.  \Cpp\ is commonly used as part of multithreaded
applications, sometimes with either direct calls into an OS-provided
threading library (e.g. POSIX threads (pthreads)~\cite{pthreads} or Win32 threads) or
with the aid of an intervening layer that provides a platform-neutral
interface (e.g. Boost Threads).

In N1680=04-0120, and in \cite{boehm04}, we point out the difficulties with
this approach, and the need to clearly define a memory model, analogously
to \cite{JSR133}, which
provides sufficient guarantees to the programmer to guard against
unexpected transformations, which change the semantics
of a concurrent program, though they are benign in the absence of threads.

Here we list some fundamental issues on which we seek preliminary guidance.
The first of these appears sufficiently fundamental that our group
cannot make much progress without resolving it in one direction or the
other.  We suggest resolutions where there appears to be consensus within
our group.

\section{Do we Allow Data Races?}

A {\it data race} arises when one thread in a program can potentially
write an ordinary shared memory location while another concurrently
accesses it.  This results in nondeterminism, and is usually an indication
of a programming error.\footnote{In a few cases, concurrent
accesses to shared variables, without locking can lead to much faster
programs.  We plan to provide alternate mechanisms intended exclusively
for such concurrent accesses in a way that is recognizable by the
implementation.  They are excluded from our definition
of a data race, as are volatile accesses in Java.}

Most existing multithreaded C++ applications operate in an environment
in which the semantics of data races are left intentionally undefined.
For example, the POSIX threads (pthreads)~\cite{pthreads} standard takes
this route.  This allows some additional compiler optimizations.  For
example, if we use {\tt ri} to denote register-allocated variables, and
other names to denote shared variables,
in many cases it becomes acceptable to transform:

\begin{verbatim}
r1 = x; use r1; ...; use r1;
\end{verbatim}

to

\begin{verbatim}
r1 = x; use r1; ...; r2 = x; use r2;
\end{verbatim}

If the compiler runs out of registers in which to store {\tt r1}, it may
reload the variable from its original shared source {\tt x}, provided
{\tt x} cannot be written by any intervening statement.

This transformation preserves the meaning of sequential programs, but clearly
alters the meaning of a multithreaded program in which another thread
concurrently updates {\tt x}.  The pthread standard continues to allow such
transformations, since the meaning of such programs is undefined.

On the other side, the Java programming language carefully defines
the semantics of programs with data races, and thus outlaws this
transformation, forcing {\tt r1} to be spilled to a new location in the
above example. 

The Java specification is essentially forced to make this choice, since
it tries to guard against malicious code running in the same address
space.  If the above sequence occurred in trusted code, the first use
of {\tt r1} were a safety check (raising an exception in the unsafe case),
and the second use of {\tt r1} were safe only when the check succeeded, it
would be unsafe to reload {\tt r1} from {\tt x} in the interim.  Malicious
code could repeatedly call this function while changing {\tt x}, eventually
causing the trusted code to perform an unsafe action on a value of {\tt x}
different from the one it tested.

In Java, this transformation could also result in an unchecked 
out-of-bounds array access, and hence violate type-safety if, for
example, the pointer to a shared array is reloaded between the subscript
check and array access.

Some C/C++ compilers appear to spill to new locations anyway, and
others might decide to now do so to avoid programmer surprises, so
this particular issue might not be decisive.  But there are
potentially many additional cases along these lines (for example,
caching statics).  The need to provide a full semantics for data races
in all such cases in the Java specification introduces some of the
more challenging technical issues in \cite{popl05}, some of which we
are not confident can be applied \Cpp, but which can be avoided if we
leave their semantics undefined.

The consensus within our group (grudgingly for some of us)
is that it would be preferable to leave the semantics of data races undefined,
mostly because it is much more consistent
with the current C++ specification, practice, and 
especially implementations.
And in the absence of objections from the committee, we plan to follow
that path.

The disadvantage of this approach is that any future attempt to define
a type-safe subset of \Cpp, especially if it intends to
support ``sand-boxed'' execution of untrusted code, would have to
revisit this issue, and follow a more Java-like model. 

\section{Other Issues}

We list here a number of other issues on which we would appreciate
feedback.  They are probably less time-critical.  We also expect that
we have overlooked some equally fundamental and controversial ones.
We appreciate
additions to this list.

\subsection{Implicit Writes Restricted to Bit-Fields}

Essentially all of the problems pointed to by our prior work are due to
compiler transformations that introduce stores, and hence data-races,
that did not exist in the original program.

A store to a bit-field requires reading and rewriting adjacent bit-fields
of the same struct or class, since most architectures do not support
storing say, a single bit, to memory.  Some older architectures had
similar restrictions for {\tt char} sized values, but we are not aware
of current multiprocessors with such restrictions.

We are leaning towards allowing such compiler-introduced writes only
for contiguous sequences of bit-fields in the same struct or class.
(The detailed formulation would require more care.)  Some restriction
along these lines is required for reliable multithreaded programming.
Weaker restrictions might be possible at some usability cost,
but we currently see no reason to weaken the restriction.

This would greatly penalize multiprocessor architectures which do not support
efficient atomic byte stores.  But we are not aware of any such
current architectures.  It would force uniprocessors with such
restrictions to use techniques such as those in \cite{Bershad92}.

This affects not only the way values are stored into a structure, but
also some other compiler transformations, notably fully general
speculative register
promotion \cite{SastryJu, SGIrp}.  But we see no way to simultaneously
support this optimization and reliable concurrent programming.  Further,
we believe that not all high performance C++ compilers currently
perform this transformation, so the cost of disallowing it is not huge.
Java disallows it.

\subsection{The Meaning of ``volatile''}

We are leaning towards strengthening the meaning of ``volatile'' to
make it usable for (relatively) inexpensive inter-thread
communication.  As discussed in the orginal proposal N1680=04-0120, we
also plan to define intrinsic library classes that provide more
flexible and extensive control over barriers and atomic instructions.

The main attraction of this approach to {\tt volatile} is that it
makes it more likely for programmers to write correct code, by
following the rule that any variable that is accessed by multiple
threads without a lock should be declared {\tt volatile}. (There are
cases where this is unnecessary, but they require thought and care.)
Double-checked-locking \cite{dclpIsBroken}, Dekker's algorithm, and
other common constructions ``just work'' if the relevant variables are
declared {\tt volatile}.

However, there are a few concerns that make this choice controversial:

First, this sense of volatile requires read/write atomicity.  This can
be a problem for types wider than natively supported on a processor.
On uniprocessors, this may entail some minor overhead to support
``restartable atomic sequences'' to keep together multiple read or
write instructions in the face of interrupts or context
switches. However, on some multiprocessors, there might not be any
applicable techniques short of heavy solutions such as the insertion
of otherwise inaccessible locks.  Thus, there may be types for which
the {\tt volatile} qualifier either cannot be supported on a
particular target platform, or would entail surprising time and space
overhead.  It's not clear to us how such limitations can be expressed
in the specification, but we suspect that there is some way to do
so. Guidance on this issue would be appreciated.

Second, these semantics for {\tt volatile} impact performance.  To
maintain reasonable efficiency, this approach relies on compilers to
perform some new optimizations. On multiprocessors, naive translation
requires a ``write barrier'' of some sort to be issued on each
volatile assignment. These barrier instructions are almost always
expensive. However, many of them are not actually required. For
example two successive writes to two volatiles need only one trailing
barrier. The optimizations to handle this and most cases that arise in
practice do not appear to be difficult. However, even with them,
programs with excessive reliance on {\tt volatile} will be noticeably
slower. On the other hand, if we choose this route, the detailed
semantics of {\tt volatile} are likely to be slightly weaker than in
Java, and so may cost somewhat less even in the absence of aggressive
optimization.

Third, it is possible that these semantics might conflict with some of
the current implementation-defined effects of {\tt volatile} in
certain compilers.  This does not appear likely thoough.

We believe that the advantages of this approach outweigh the
disadvantages, but would like to hear especially from compiler
implementors about any additional concerns or constraints.

\subsection{Function-scope statics}

Construction of function-scope statics may require the compiler to
introduce an ``is this initialized'' flag variable.  In a multithreaded
program access to this flag variable may introduce a data race
not visible to the programmer.  There appear to be three possible
solutions:

\begin{enumerate}

\item Document the existence of the flag, and let the programmer
add suitable synchronization code.

\item Let the compiler add the synchronization code.

\item Deprecate the construct, except for initializations to compile-time
constants.

\end{enumerate}

The first option may be very surprising to the programmer.  The second option
adds (potentially expensive, if done correctly) memory barriers, which we
expect are usually redundant with programmer-provided synchronization.  Thus
it may surprise the programmer with unexpected performance problems.  It
also may cause the compiler to generate thread-library-dependent code,
or it requires a standard API across thread libraries.

Based on a very limited sample, current practice seems to favor the
second option.  Our group currently seems to lean towards the third, though it
does appear drastic.

We expect this is an issue that was previously debated by many compiler
implementors.  We would like to better understand the various outcomes.
We would also like to know about the prevalence of such constructions
in practice. If they are rare, a heavyweight solution may be work out
fine.

\subsection{The Threading API}

Our group has had interesting internal discussions about the
desirability of defining a C++-standard threading API.  Current
practice appears to be split between many different APIs or language
extensions, often with significantly different programming models.
(Pthreads, Win32 threads, OpenMP, and Boost threads, are probably
among the most widely used, with many other very significant contenders.)

Real standardization would almost certainly be a benefit here.
But since many of these are already widely established, it is unclear whether
that is achievable.  And there seems to be relatively little consensus
as to what it should look like.

We plan to separate the memory model discussions from the threading
API discussions as much as possible.  And it appears to be possible
to define the memory model without reference to the details of the
threading API, and in a way which applies across all of the above
models.

If there is any consensus on the C++ committee as to whether we
should pursue a threading API, or what it should include, we would like
to learn about it.

\bibliography{multithreading}

\end{document}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: issues.pdf
Type: application/pdf
Size: 80040 bytes
Desc: not available
Url : http://shadbolt.decadentplace.org.uk/pipermail/cpp-threads/attachments/20050305/45f52aaa/issues.pdf
-------------- next part --------------
154c154
< above example.
---
> above example. 
171,176c171,179
< The need to deal with data races in the Java specification also
< introduces some of the more challenging technical issues in \cite{popl05},
< which we hope to avoid if we leave their semantics undefined.  It
< further essentially requires that load and store operations on
< ordinary pointers be atomic, which may require effort for some
< embedded processors.
---
> Some C/C++ compilers appear to spill to new locations anyway, and
> others might decide to now do so to avoid programmer surprises, so
> this particular issue might not be decisive.  But there are
> potentially many additional cases along these lines (for example,
> caching statics).  The need to provide a full semantics for data races
> in all such cases in the Java specification introduces some of the
> more challenging technical issues in \cite{popl05}, some of which we
> are not confident can be applied \Cpp, but which can be avoided if we
> leave their semantics undefined.
189c192
< revisit this issue, and follow a more Java-like model.
---
> revisit this issue, and follow a more Java-like model. 
190a194
> 
234,241c238,242
< Although this remains a controversial issue, and requires further investigation,we are leaning towards
< strengthening the meaning of ``volatile'' to make similar to its meaning in
< Java, and usable for inexpensive inter-thread communication, while
< providing the weaker (and cheaper) semantics currently provided by some
< implementations through a library.  This, for example, makes it
< possible to correctly implement
< ``double-checked locking''\cite{dclpIsBroken}, a common,
< though currently incorrect, programming idiom for multi-threaded C++.
---
> We are leaning towards strengthening the meaning of ``volatile'' to
> make it usable for (relatively) inexpensive inter-thread
> communication.  As discussed in the orginal proposal N1680=04-0120, we
> also plan to define intrinsic library classes that provide more
> flexible and extensive control over barriers and atomic instructions.
243,244c244,251
< Doing so appears to require making volatile accesses atomic, at least
< in some common cases.
---
> The main attraction of this approach to {\tt volatile} is that it
> makes it more likely for programmers to write correct code, by
> following the rule that any variable that is accessed by multiple
> threads without a lock should be declared {\tt volatile}. (There are
> cases where this is unnecessary, but they require thought and care.)
> Double-checked-locking \cite{dclpIsBroken}, Dekker's algorithm, and
> other common constructions ``just work'' if the relevant variables are
> declared {\tt volatile}.
245a253,293
> However, there are a few concerns that make this choice controversial:
> 
> First, this sense of volatile requires read/write atomicity.  This can
> be a problem for types wider than natively supported on a processor.
> On uniprocessors, this may entail some minor overhead to support
> ``restartable atomic sequences'' to keep together multiple read or
> write instructions in the face of interrupts or context
> switches. However, on some multiprocessors, there might not be any
> applicable techniques short of heavy solutions such as the insertion
> of otherwise inaccessible locks.  Thus, there may be types for which
> the {\tt volatile} qualifier either cannot be supported on a
> particular target platform, or would entail surprising time and space
> overhead.  It's not clear to us how such limitations can be expressed
> in the specification, but we suspect that there is some way to do
> so. Guidance on this issue would be appreciated.
> 
> Second, these semantics for {\tt volatile} impact performance.  To
> maintain reasonable efficiency, this approach relies on compilers to
> perform some new optimizations. On multiprocessors, naive translation
> requires a ``write barrier'' of some sort to be issued on each
> volatile assignment. These barrier instructions are almost always
> expensive. However, many of them are not actually required. For
> example two successive writes to two volatiles need only one trailing
> barrier. The optimizations to handle this and most cases that arise in
> practice do not appear to be difficult. However, even with them,
> programs with excessive reliance on {\tt volatile} will be noticeably
> slower. On the other hand, if we choose this route, the detailed
> semantics of {\tt volatile} are likely to be slightly weaker than in
> Java, and so may cost somewhat less even in the absence of aggressive
> optimization.
> 
> Third, it is possible that these semantics might conflict with some of
> the current implementation-defined effects of {\tt volatile} in
> certain compilers.  This does not appear likely thoough.
> 
> We believe that the advantages of this approach outweigh the
> disadvantages, but would like to hear especially from compiler
> implementors about any additional concerns or constraints.
> 
> 
> 
278a327,329
> We would also like to know about the prevalence of such constructions
> in practice. If they are rare, a heavyweight solution may be work out
> fine.
282c333
< Our group has had an interesting internal discussions about the
---
> Our group has had interesting internal discussions about the