[cpp-threads] RE: A hopefully clearer document on POSIX threads

Sun Feb 19 18:33:01 GMT 2006

This is my collected responses.

Valentin Samko <cxxpanel.valentin at samko.info> wrote:
> 
> BH> 2) In the flat model, there is no way to start a thread that outlives a
> BH> function invocation (?)  I think this is a showstopper for certain kinds
> BH> of libraries, though it's no doubt OK, and even convenient, for most
> BH> kinds of HPC programming.
> 
> I believe that if we only introduce a hierarchical threading model in
> C++, most developers will not appreciate this at all. I think we need
> to agree that this model is the most popular model currently used in
> C++ (not considering HPC), and that it is not going anywhere, no matter
> whether the C++ standard officially supports it or not.

While that is true, and let's skip the reasons for now, there are two
questions here: WHICH flat model, and can it be done without destroying
many of C++'s principles.

> BH> 3) Depending on the details, it may be possible to directly access local
> BH> variables from enclosing scopes in the hierarchical model.  This is
> BH> clearly more convenient when it makes sense.
> 
> One can still do that with a flat model (although not that directly)
> by creating a local function object which references local variables
> and pass it to any thread for the execution. User (possibly with a
> help of the language) has to guarantee that that function object will
> complete its execution before these local variables are destroyed.

Right.  And specifying THAT is one of the nasty problems.  Remember
that we now have a two-dimensional problem, as distinct from the
one-dimensional one in current C++.

> What happens in terms of performance on a massively multi cpu/core box when
> a single thread calls a memory barrier? Does this impact all the other
> CPU/cores? Will we need a new type of mutexes, memory barriers, ... which
> can be used to only invoke a barrier on a few selected CPU/cores?

(a) it doesn't work in the way that the user expected and/or (b) it
serialises the whole machine.  Yes, partial barriers are needed, but
they are only part of the solution.

For complex reasons, only some of which I can explain, barriers are a
complete mess on both distributed and shared memory systems.  Some of
this is because the hardware and software does not provide any suitable
primitives - Blue Gene has gone as far as providing a special network
for barriers - and that is one of the many reasons that I say that
POSIX's treatment of threads as processes is inappropriate.

Hans Boehm <Hans.Boehm at hp.com> wrote:
>
> > Not at all.  The other way is MUCH easier!  All you have to do is to
> > spawn N threads on program entry, and then use the flat model :-)
> > Note that is what most tuned implementations actually do - when
> > a function calls pthread_create, all it does is to activate one
> > of the "thread slots" that were set up initially.
> 
> Yes, but that doesn't really work in the presence of libararies that
> spawn daemon threads on first invocation, or when the library is
> dynamically loaded.  I believe that's quite common, though not in HPC
> code.  I think it has to be supported at some level.

Yes, it does.  I didn't say that the number of thread slots had to be
small - after all, an inactive thread slot needs a single 'void *' if an
array is used and no space at all if a list is.  If you look at pretty
well ANY current system, you will find that there are system parameters
that limit the number of threads per process - the kernel almost always
uses this model at some level.

All that the call really has to do is to provide a handle so that the
controlled process can ask questions about and otherwise operate on
the set of all threads.

> And I think a lot of people would like to be able to create an object in
> the heap, and then have a thread associated with that object which
> lives as long as the object.

Yes.  Don't ever think that I object to clean models like that one!
C++ should clearly at least permit extensions to provide that.

> Right.  But Java for example provides thread pools in a library,
> which can be used to deal with finer grained tasks, basically as
> you suggest.

Ah.  Does it?  It doesn't surprise me - it is an ancient technology.

> And I think kernel-based threads are no longer as expensive as
> people tend to believe.  It's easy to find discussions on the web
> about starting more than 10^5 NPTL (kernel) threads under Linux.

Oh, the Linux community is good at that - talk, I mean :-(

As with most of these complex issues where the complexity is in the
interaction, it is easy to provide scalable demonstrations.  But the
number of those that survive when exposed to Real Requirements and Real
Problems is small, miniscule even.  SGI are the only people I know of
with a Linux that scales to above a dozen CPUs or so, and they had hell
into that position (mostly during the IRIX era).  Sun had similar hell
with Solaris and still trails SGI - I believe that HP is close to Sun in
this respect.  Please don't rattle my cage about IBM and AIX!

Handling 10^5 threads has been easy for decades using the model of
scheduling them completely separately, which works if they are
semi-independent.  But it is an absolute disaster when they aren't,
and the scheduler makes the wrong assumptions.  I really am NOT joking
about slowdowns by factors of 100+ and even 1000+ - been there, seen
that :-(

> I think most kernel schedulers are now designed to handle context switches
> in constant time, independent of the number of threads.  (Since you
> do need a couple of pages of stacks for each, at a minimum, this sort
> of stuff tends to be easier on 64-bit machines.)

Their claims are  both simplistic (it ignores cache and TLB effects)
and irrelevant.  It isn't the time to switch context that causes the
worst of the trouble, and never was - that was discovered VERY early
(1960s?)

Valentin Samko <cxxpanel.valentin at samko.info> wrote:
> 
> I meant the memory fence which guarantees CPU cache coherency and that
> read/write reordering does not cause problems. This is not an issue
> right now on x86 architectures as the CPU cache coherency is guaranteed by
> the hardware (but reordering still occurs), but on Itanium only CPU
> data cache coherency is guaranteed, and I think that such guarantees
> do not make much sense on massively multi cpu/core machines.
> 
> For example, I understand that "interlocked" operations use such memory fences.
> How can these interlocked operations can be performed with local memory barriers
> (i.e. local to a few CPUs), considering that other processes may be using that
> memory (if it is shared) as well?
> 
> Don't we need to guarantee the CPU cache coherency across all the CPU's?
> 
> I understand that NUMA is designed to fix this type of problems, but one
> needs to adapt the source code or the language to make a use of it.

Grrk.  I am afraid that the above is severely confused, but the
situation is so complex that I can't easily explain.  To start at the
back:

NUMA (non-uniform memory architecture) is not about the coherence
policy, but about the performance of local versus remote accesses.
If you mean NUMA that is not ccNUMA (i.e. not cache-coherent NUMA),
that was widespread in the 1980s, performed very well, and attempts
to implement threading languages failed dismally.  99% of even expert
programmers could not handle the model :-(

I favour incoherent shared memory systems, as they are very scalable and
can support 90% of what people want to do, but I would NOT support
shared-memory threading using them.  They support a shared file cache,
SHMEM, process migration, FIFOs between processes etc. very well.

All modern SMP systems provide globally cache-coherent shared memory,
but it doesn't do what most people think that it does.  In particular,
it doesn't do anything about register and non-memory consistency (e.g.
floating-point flags, exceptions, signals), which still has to be done
by the code.  But those aspects are not required by C, C++ or POSIX, and
none of them provide any relevant synchronisation primitives!  This
causes people like me major headaches when a user gets caught out.

So what a partial barrier would do is to synchronise those aspects on
a subset of CPUs, and rely on the ccNUMA to sort out the rest. 
Whereupon you are at the mercy of the architectural models and the
implementation (hardware AND operating system) - strong versus weak
consistency (and those terms keep changing meaning) and so on.

Lawrence.Crowl at Sun.com wrote:
>
>  >Not at all.  The other way is MUCH easier!  All you have to do is to
>  >spawn N threads on program entry, and then use the flat model :-)
>  >Note that is what most tuned implementations actually do - when
>  >a function calls pthread_create, all it does is to activate one
>  >of the "thread slots" that were set up initially.
> 
> That works until you need N+1 threads.  While recycling threads is a
> good idea, I don't think they've abandoned the ability to create
> threads on demand.

See my comment above.

>  >    The scoping problem.  Threads are started within a scope, and then
>  >are permitted to escape that scope.  This could be closed by forbidding
>  >threads to use anything with less than global scope.
> 
> How is this problem any different from stack allocation versus heap
> allocation?  When the problem is chaotic, the flat threading model
> seems most likely to minimize resource use.

While that is true, the C++ standard has to jump through hoops to
specify what is defined and what is not.  I am simply stating that it
will have to jump through as many hoops - AND then integrate them with
the existing hoops - to use any flat model.  See my remark to Valentin
about two- version one-dimensional problems :-(

>  >    The thread/process isolation problem.  Are threads allowed to do
>  >things that override what other threads are trying to do, and how
>  >are conflicts resolved?  Think about one thread changing the global
>  >state or suspending a thread in the middle of stack unwinding.
> 
> The memory model is the answer to changing shared state.  This issue
> applies with both flat and heirarchical models.

No, that is not true.  It is REQUIRED for both models, but it is NOT
the ANSWER - i.e. it is not sufficient on its own.

> The problem with suspending a thread in the middle of stack unwinding
> is inherent, and occurs in both models.  If a thread has gone into an
> infinite loop, the only course of actions to save the system is to kill
> the thread, and suffer the consequences.

I am afraid not.  Firstly, I am not talking about just infinite loops,
but the decision about what to do when that happens (which has to be
taken without knowing whether the other thread is looping indefinitely
or just doing something very complex).  Secondly, the most common
hierarchical models don't allow threads to cancel/suspend their
siblings.

>  >  OpenMP's approach to C++ exceptions is badly flawed, and a message to
>  >that effect would be worthwhile.
> 
> I was in on the early discussions, and there was no consensus for
> requiring exception propogation.  Part of the issue is that it's hard
> to say what happens to the other work when an exception occurs.

OK.  So it is a good idea to simply ignore unhandled exceptions?  That
is the model used by most GUIs and several other standards, but C++
comes out strongly against it.

If OpenMP had said that it was implementation-defined whether an
unhandled exception would be propagated or would call terminate(),
I would have no problem with it.  But it doesn't.

> Given that Windows thread semantics are also a constraint on the C++
> threading model, I suspect that the C++ thread model may will not be
> able to directly implement the OpenMP model.  That is okay, though,
> because implementers can use underlying implementations.

Aargh!  If the models are incompatible, it means that EVERY program is
undefined, and so it becomes impossible to report even the most
outrageous bug, because the 2nd or 3rd level support throw out the
report as programmer error.  So the bugs never get fixed.  Please don't
tempt me to name companies ....

> In practical terms, modern Unix systems provide Posix threads, and
> compiler implementors have very little opporunity to define or use
> anything else.

They have plenty of opportunity to define other things - after all they
don't provide C++ classes - but I agree that they may have to use them.
However, I really DO mean that I have (mostly second-hand but some
first-hand) experience of implementing one threading model on top of a
different one.

Hans Boehm <hans.boehm at hp.com> wrote:
>
> Assuming you mean what's also called a "memory fence", i.e. an
> instruction that enforces certain kinds of inter-thread memory ordering,
> then that's generally implemented locally, I believe.  I don't think
> it gets dramatically more expensive in large machines.

I am afraid that it does.  Badly.  However, as always, there is a
tradeoff between the expense of the hardware and the expense of slowing
down.  But we are talking about quite large factors - for example, one
needs a directory and not a snoop cache above about 8 CPUs.

> If you are talking about barriers that require all threads to reach the
> barrier before any can proceed, then I think there are standard techniques
> that require time logarithmic in the number of processors.  Those may
> get a bit slower, though other techniques may apply with
> multiple processors on a chip.

In theory.  In practice, the problem is the scheduler often gets its
knickers in a twist.  The root cause is the lack of suitable hardware
and operating system primitives for threading (as distinct from processes).

> The machines we're directly addressing are all cache-coherent, i.e. the
> hardware assures that all the caches are consistent.  Typically this
> means that a machine can only update a cache line after acquiring
> exclusive ownership of the line, thus ensuring it is cached nowhere
> else.

Yes.  Starting from incoherent shared memory, the only sane place to
go is message passing and/or explicit data sharing (e.g. many SHMEMs).

> This does not ensure sequential consistency since, for example, all
> modern processors contain a store buffer, and thus stores appear to
> complete, and the result becomes locally visible, before the value makes
> it into the cache, or becomes visible to other processors.  

And, equally seriously, between registers and the store buffer.  It gets
worse on systems like the System/370, Alpha and Itanium, all of which
may get delayed requests for software fixup that are visible to the
program.  This probably occurs on the x86, too, in its older modes.

> "Interlocked" instructions are generally also implemented by first
> getting exclusive access to a cache line, and then performing the
> operation locally.  Similar observations apply.

I wish :-(

The killer is that so much of that requires privilege, so unprivileged
applications have to abuse other mechanisms to get the same effect.

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1 at cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679