[Stackless] Is Stackless single core by nature?

Henning Diedrich hd at authentic-internet.de
Thu Jul 9 17:20:23 CEST 2009

Hi Richard,

coming back to the thread, I meanwhile found the treasure trove to dive 
into for my questions: http://wiki.python.org/moin/ParallelProcessing

That's a list of efforts to nudge Python towards parallel/multicore/ 
etc. Did you work with any of them?  Is there any specific or general 
rule, as to which will work with Stackless?

Does MPI4Py?

I am not yet done looking at all of them.
>>> You can take the basic
>>> functionality and build up your own framework around this.  Tired of
>>> callbacks?  Make a function that wraps an asynchronous operation in a
>>> channel and whatever calls it will just read as a synchronous call.
>>> Of course, a programmer needs to be aware of the effect of blocking
>>> and when blocking might happen on the code they write, but in practice
>>> this is rarely much of a concern.
>> Could I do this if I left single core behind ... ? To my eye that is part of the advantages you achieved with the very clear architecture decisions you opted for with EVE. The more flexible and complex ways you had referred to, might have turned out way more complex in this regard.
> I don't understand what you are asking here.  The ability to provide a
> function that blocks in a synchronous way wrapping asynchronous IO is
> a benefit that comes with any real coroutine-like solution.  And it
> can be applied as a building block in any framework you build, whether
> one core/process or multiple cores/process per core.
But in a SMP environment you run into concurrent resource access, as one 
effect of blocking, issues that you are completely isolated from when 
staying one-core, protected by the guaranteed sequentiality that this 

It that sense I had referred to "a programmer needs to be aware ... but 
in practice this is rarely much of concern": I wondered if this may get 
way more complicated as soon as you'd have multi-core concurrency and 
the need to protect resources from contention.

The hiding of asynchronous operations - in its simplest incarnation in a 
physically sequential environment - is not called in question. But the 
implications of such a layout in a distributed or multi-core 
environment, across multiple blades even, at best 'transparently 

That's very much what I am looking for, the *implications* of hiding of 
complexity (in the extreme even 'burried' inside the language itself) 
and of different syntactical approaches in different languages. And, by 
default I would *not* like to see vital functions 'out of my reach', 
whether hidden inside a function or a language, see below.

But eventually it should be 'easy' to use, as much as possible, for the 
very reasons you were citing that speak for Python as a language. MPI4Py 
then, at least unwrapped, seems not really what I am looking for, by 
itself, but probably a stepping stone.  The blocking tutorial samples 
are nice but *asynchronous* stuff comes with strong warnings. It would 
have to be wrapped again, to make it robust ---- and maybe that is the 
way to go. I can't tell if that can still end up to be as 'elegant' as 
I'd hope.

The promise with Erlang, is that the restriction of "no shared state" is 
going to provide for the transparancy of messaging, obviously then by 
design, dead lock-free. Depending how much you heard about it, that may 
even sound hard to believe, but it's a deceptively simple principle. 
Without that, i.e. with shared state, you'll have the usual 
opportunities for races and dead locks. And yes, that can be done right, 
and still, in practice, from a certain size of the system on, it can 
also turn into a nightmare.

And, it's a factor here that not everyone in a team of programmers may 
turn out to be a genius (I won't protest if you tell me at, CCP you are 
;-), so restriction by design, if well chosen, may save a hell of a lot 
of time.

But all this, regarding "architecture decision", is what you avoid when 
you stay on one core; and basically stay protected by underlying ensured 
sequentiallity of (micro-)threads, while creating parallelly formulated 

(1) -  Joe Armstrong, Erlang's Rossum, on concurrency strategies (among 
other things): http://www.pragprog.com/articles/erlang

(2)  - Guido v. Rossum pro GIL (and why removing the GIL is *not* what 
people *really* want): 

>>> Stackless has a scheduler which runs on a real thread, and
>>> all microthreads created on that thread are run within that scheduler.
>>> You can have multiple threads each running their own scheduler, with
>>> their own tasklets running within them.
>> Can channels reach out of their interpreter/scheduler? Or can a Stackless
>> interpreter run across multiple cores, or even blades? Are there modules or
>> extensions that provide for this, or for transparency in this regard?
> This is mostly on the user.  Stackless is a basic set of functionality
> (scheduling of microthreads, microthread serialisation).  There are no
> modules or extensions to take it further in other directions.
> However, if you can ensure your newly launched thread goes on the core
> you want, then the interpreter can be considered to run across
> multiple cores.  This is a Python problem, not a Stackless one.
As I said, I am still going through the presented approaches at 

Is there a rule of thumb, or a list, of what modules and libraries run 
with Stackless?

I am still hoping to find the stackless-compatible concurrency support I 
am looking for. But otherwise, would Stackless then stay close, and 'on 
top' of the main Python branch, which in turn will likely not implement 
multi-threading as that would be obstructed by the GIL-philosophy? (see 
(2) above)

Even if Stackless is not originally about multi-core or distributed 
processing: just as it is not a *language* issue that CPython has the 
GIL, but an implementation issue of CPython (as discussed at (2))-  
would not the Stackless syntax be just what one wanted to use 
multi-cores and distribute calculations to multiple computers? 
Potentially extended (or reduced!) to deal with shared resources? It 
just seems to lend itself to that exceptionally well and would not have 
to pass in the last hurdle, as Java does (see (6) at the bottom). The 
last hurdle being microthreads, wich make Elrang and Stackless seem very 

Would not even EVE have to expect, in the future, that Blades will 
become faster on a much slower pace, measured per core, but offering 
more cores instead as today's proposition of speed improvement? Growth 
by hardware should get harder to realize, staying with one core. But 
maybe you fork out different stuff to keep cores busy in a different way.

Erlang, got multi-threaded only quite recently, in 2007
(3) - 

As would be expected with no language changes, only the VM was adapted, 
which people at Ericsson where rightfully proud about. I imagine the 
Erlang hype of 2007/8 was fired up by this fact. I had initially thought 
Stackless was just as destined for that feat.

This may neatly clarify similarities and differences between Stackless 
and Erlang (Joe, Armstrong, quote from (3)):

"The Erlang VM is written in C and run as one process on the host 
operating system (OS). Within the Erlang VM an internal scheduler is 
responsible for running the Erlang processes (which can be many 
thousands). In the SMP version of the Erlang VM, there can be many such 
schedulers running in separate OS threads. As default there will be as 
many schedulers as there are processors or processor cores on the system.

"The SMP support is totally transparent for the Erlang programs. That 
is, there is no need to change or recompile existing programs. Programs 
with built-in assumptions about sequential execution must be rewritten 
in order to take advantage of the SMP support, however."

That this worked was because of the way that Erlang had focused on 
making distributed computations possible: again, the paradigm of no 
shared state. As this is inherent in Erlang, Erlang could transparently 
be made to use multi-cores.

Even if Stackless cannot follow that leap (sic, p), my impression was 
that it may be the natural starting point for Python to get there, if 
probably with syntactic modification needed. It's coming from a 
different approach of (not) dealing with state in concurrency but seems 
as microprocess centered by design as Erlang.

I maybe just haven't found the project that is doing this just yet. Or 
there is a fundamental problem that yet eludes me (I know that Erlang 
followers would immediately second that. But 'shared state' is not 
universally thought of as a bad thing).
> Running a Stackless interpreter across multiple blades makes no sense
> - as such Erlang wouldn't be able to do it either.  A program runs on
> one machine, not several, is what I am saying.  And the Python
> interpreter is a program.
Just wanted to make sure that I am not missing a point that was too far 
out for me to imagine.

Erlang is somewhat transparent though across multiple machines. Like as 
if channels would work exactly the same way across machines or locally.

But sure, there'd be one Erlang VM running per machine.

Maybe that's the gordian knot, thinking single mutli-computer VM.
> I should note that while at CCP, I wrote part of a framework that ran
> an agent on each machine involved.  There was a master program and it
> would communicate with each running agent telling it to start
> sub-applications to farm off work to.  All programs, whether agents,
> master and sub-applications were specialisations of the CCP Stackless
> Python based application.  There was no pickling involved, however.
> Unless I am mistaken, this sort of arbitrary ability to start up
> instances of the interpreter on involved machines is as close as you
> would be able to get to "or even blades", no matter the language (and
> framework) used.

Plus what you did with no pickling is probably close to Erlang 
philosphy: (if not literally because you can send all sort of things 
with an Erlang message, but:) if you didn't pickle you probably also did 
not expect state sent back as immediate answers, except for basic 'ok's. 
Which is close to the Erlang's 'return-less' (Actor model) messages. 
>> That pickling works even across diverse OSses is an exciting feature (2).
>> And I am still working to get my head around what happens to state when
>> sending tasklets over to another box (3). It doesn't look quite trivial.
> I don't have a clear picture either.  But it should be something that
> someone who intends to use this functionality should be able to easily
> get a handle on with a little experimentation.  In my book, it is
> better to have a choice in how this works (as you would with
> Stackless), than to have an inflexible predetermined solution forced
> upon you (as I think you get with Erlang, but may be wrong).
Same here. That it is *not* an integral part, thus potentially less 
removed, can be considered a *plus*, as it would be accessible, 
changeable, fixable and tuneable.

On the other hand, Erlang's track record is impressive enough to 
interpolate that so much work went into solving exactly these problems 
that it should not be passed up upon, and that that wheel need not be 
re-invented. Where it is *not* a language issue, but of implementation, 
I trust that they found over they years what caches make sense where and 
what un-intuitive modifications bring extra speed and/or stability.

However, you'd find out the contrary only with much pain. At worst after 
the system is ready for prime time and suddenly starts to sputter. I 
spoke with Thorsten Schütt of Scalaris and he confirmed in a way that 
'soft real time', as the Erlang claim goes, does not mean 'real time'.

But the real difference is in the concurrency philosophy of Erlang that 
simply prevents the worst sort of problems from happening in the first 
place. That productivity argument should be easily acceptable in the 
Python world.

Specifically speaking of pickled tasklets though, Erlang does not send 
processes around.
> Regarding Pyro.  Its webpage says "you just call a method on a remote
> object as if it were a local object".  For this to be a true
> statement, it must block the current thread while the call takes
> place.  This would be incompatible with Stackless, as the tasklet
> making the call would block the scheduler preventing any other tasklet
> from running.
Thanks for looking into that. I found it hard to find any mention of Pyro and Stackless together, which only supports your conclusion.

Reminds of the raison d'etre for StacklessIO. Could it yield equal rewards? You had mentioned a drop in you wrote with the same functionality as StacklessIO. How difficult could it be to shift the blocking to the level of the tasklet, away from the system thread, with Pyro?

Shouldn't that be rather painless, given that Pyro is native Python (http://pyro.sourceforge.net/manual/1-intro.html)?

Or am I missing something there?

> Writing an RPC mechanism using Stackless is straightforward, if you
> are familiar with networking and Stackless.  Here is one I have
> written:
> http://code.google.com/p/stacklessexamples/source/browse/#svn/trunk/examples/networking/stacklessrpc
> It is possible to write a simpler version of course, my one being a
> little abstract.
>> But is pickling fast enough to do more interactive stuff than load balancing
>> (e.g. loading complete solar systems off to a different blade that has
>> better hardware or because the current blade had more than one solar system
>> mounted). Is it fast enough to completely distribute entities?
> I have no experience with this, so cannot say one way or the other.
I have meanwhile read some about MPI4Py 
(http://pypi.python.org/pypi/mpi4py) and what they do is direct access 
down to the memory blocks where the Python objects lie, as binary, to 
prevent the overhead of pickling and marshaling. Unless endian 
conversion is needed, and unless the underlying MPI lib should add some 
overhead I am not aware of, that should be as fast as it can get.

And again, doing this with asynchronous calls gets a big warning in the 
manual, because the programmer must see to that state of the memory 
block being sent is never changed while the send is pending, as on a 
deeper level obviously there may be partial sending, er re-sending going 
on. (See manual in the installation package, docs/mpi4py.pdf, pg. 4. Did 
not find it online.)

MPI4Py does that to allow for better performance and for sending even of 
"objects bigger than half the availabel memory" (:-).

Erlang is "different" here again. As I understand, exactly for this very 
occasion. Variables in Erlang cannot change state. Once a variable is 
bound to a value it cannot be assigned a new value, ever. That should in 
fact allow for peace of mind when using the internal, binary buffer of a 
variable itself to send a 'copy' of state over the network or to another 
process. And again, it's not only about what is possible but what 
improces productivity. These restrictions are for robustness of 
concurrency what Java's elimination of pointers was for robustness of 
memory management (I know, there were others before Java and before 
Erlang, respectively, and there was everything already there in LISP a 
century ago).

This is the 'discipline' question again. Certainly, "no shared state" 
and "no re-binds" can be *emulated* in any language that can share state 
and can change assigned state. But then, the problem is not the theory 
but that actively preventing this to happen, as Erlang does, may save a 
crucial lot of nerves and time and make things possible that are more 
complex because it prevents one from 0.001% inadvertant stupid things 
ruining Five Nines of good.

In the concrete case with MPI4Py and asynch sends I am not sure why the 
immutable concept in Python does not pre-empt this problem. Is that 
because MPI4P works on memory level even with complex data constructs, 
which *are* mutable?

However, immutability in Python seems like coming from the same corner 
as that in Erlang: improved clarity, less chances for errors. Can't tell 
>> As this is what I can't yet fathom about Erlang, how it's paradigma of not
>> sharing state may work well for telecom but not for games. Since that virtue
>> is achieved by taking the liberty from the programmer, it could be
>> replicated by discipline in other languages. But the language inherent
>> features of Erlang would have to be coded in Python, most everytime that
>> they would come into play, making the source more complicated, losing
>> readability.
> I don't believe I agree with this.  Maybe you are thinking of some
> Erlang features I am not familiar with.
It's along the lines that Stackless introduced a whole new concept, 
making possible a whole different way to formulate solutions. But very 
much by changes 'under the hood'. Will it be possible to bring Erlang's 
model to Python without loss of style?

I shall bother you with details, if you care, in a nutshell.

In Erlang, a message is sent like this: Receiver ! Message.

And the receiver can be anywhere.

To some extent this sending syntax is the equivalent of a method call in 
OO. It is the way Erlang "processes" communicate. And processes are as 
ubiquitous in Erlang as objects in Python. In fact, they can be regarded 
as objects to some sense (see (8) below, really 'actors') and as such 
<Receiver ! Message> is as common as, and some say, really *is*, a 
method call in Erlang. Again, some will agree, some won't.

Python is about making things elegant and simple to achieve readability 
and in the end productivity. It goes further, as Torvalds lectured the 
Google croud recently, "if you can do something really fast and really 
well, people start using it differently." -- which holds for performance 
and elegance and simplicity, too, I'd say. This is why the abstraction 
and transparency of, in essence, any "method call" (message sending) in 
Erlang may make a difference: *it's in the same form*, no matter how far 
it travels, to a local or non-local receiver. Like, say, UDP there is no 
(explicit) handshake as I understand, with the same gains and drawbacks. 
As Thorsten of Scalaris described, he started out on his Laptop and then 
found it "made no difference" if he simulated 100 nodes locally or 
distributed them on several machines.

<Receiver> can contain a the equivalent of a function name that is 
registered to a name service, and some of the remaining address part is 
a proper domain name. <Message> can be pretty much anything allowed as 
Erlang expression. But it is never a reference, not even if the message 
goes to an Erlang process in the same system process, it is always a 
copy. This is essential to the way Erlang works and this is what I have 
doubts about: that may work better for telecom than for many other 
things and may confine Erlang to its niche.

However, this construct will call any 'method' anywhere in the visible 
network of nodes, in the same process, the same core, the sibbling core 
on the same CPU, a co-CPU, or a remote machine.

Such a call is always one-way, non-blocking, it does *not* return a 
return message from the Receiver. If the receiver wants to send 
something back, it would be the same way, sending a message. Back, one-way.

The pendant to the sending construct is a receive block, which is always 
blocking, but can easily be set to an individual time out as native part 
of the language.

With "loosing readability" I referred to my wondering if the actor model 
can be emulated in Stackless, or programmed on top of it, with the 
result of something as elegant and addressee-distance-unaware as in 
Erlang. Or if there is a Python package out there for this that will 
work with Stackless. I still have some candidates to check out.

The mentioned "discipline" would be to make only calls that are 
non-blocking and do not even expect anything back, ever. Have receiver 
functions to be called from remote processes that are blocking, as you 
suggested, and never return anything to the remote process. The 
discipline would also demand to not share state, which would provide for 
no dead locks. That might do and I can't think it through to the end 
yet. But I can also hardly imagine a bunch of programmers sticking to 
such 'voluntary' restrictions without diviating "for good reasons", of 
course, here and there. So, yes, a framework would have to built and 
declared mandatory, knowing full well that it is a subset of available 
possibilities, a restriction for good reason. I have no idea how 
complicated and/or unelegant the use of such a framework would turn out 
to be. MPI itself sure sounds like a different planet than Erlang.

If I could kindle your interest, I found the following post by Slava 
Akhmechet a rewarding read, both for humor and enlightment.

It plays through the thought of how Java could be extended in the 
direction of Erlang and why and what for. It also stops exactly at an 
unsurmountable hurdle for Java, which happens to be Stackless' 
specialty: microprocesses. 
(6) - http://www.defmacro.org/ramblings/concurrency.html

This is a commendable article by Bruce Tate of IBM, which is looking at 
Erlang from the Java angle, too:
(7) - http://www.ibm.com/developerworks/java/library/j-cb04186.html

Ralph Johnson explains why Erlang processes are objects, even if this 
should send Joe Armstrong, Erlang's creator kicking and screaming: (8) - 


Where I currently got to is Candygram ( 
http://candygram.sourceforge.net/overview.html ), explicitly an Erlang 
epigon, looking quite quiet since 2004. Probably suffering from the 
fact, too, that it can't have many threads, not near the numbers of 
Erlang and Stackless. I can't tell why it couldn't run with Stackless, 
you surely can?

If you would have an answer to that it would be very much appreciated.

There is a post from 2006 on 
, Bob Ippolito answering Ivan Krstic':

"Candygram is heavyweight by trade-off, not because it has to be. 
Candygram could absolutely be implemented efficiently in current
Python if a Twisted-like style was used. An API that exploits Python 
2.5's with blocks and enhanced iterators would make it less verbose
than a traditional twisted app and potentially easier to learn. 
Stackless or greenlets could be used for an even lighter weight API,
though not as portably."
. . .

"> * Introduce microthreads, declare that Python endorses Erlang's 
no-sharing approach to concurrency, and incorporate something like
"> candygram into the stdlib.

"We have cooperatively scheduled microthreads with ugly syntax (yield), 
or more platform-specific and much less debuggable microthreads with 
stackless or greenlets.

"The missing part is the async message passing API and the libraries to 
go with it."

End of quote.

*This* is exactly what MPI4Py now *does* provide I take, but rather 
'dangerously' as outlined, which could be wrapped and brought into the 
form of Candygram?

I will look around still more, since I am still suspecting that what I 
am looking for is probably out there already.

What puzzles me is how you seem rather unphased about these multi-core 
issues. Isn't Stackless *the* place from where this should come to 
CPython? Is the potential in this irrelevant for some reason I am 
missing out on? Or for some reason uninteresting for CCP?

Best regards and thank you for your, or any other readers thoughts on this,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.stackless.com/pipermail/stackless/attachments/20090709/4fadd18d/attachment.htm>

More information about the Stackless mailing list