[Stackless] Porting Stackless

Christian Tismer tismer at tismer.com
Wed Mar 12 03:55:20 CET 2003

Paul Sheer wrote:
> Besides that, I'm going to go for a completely new
> strategy, starting tomorrow. IronPort has made me
> re-think things in a radical way, which solves everything.
> penny for your thoughts?

Two Eurocent, and I'm fine. :-)

The problem was how to integrate Stackless 1.0 with 2.0 .
The reason is, besides other good reasons which come
secondary, that I was unable to implement IronPort's
scheduler without changing it for 2.0 .

This code uses a frame pushing paradigm: Push a frame
with a special f_execute function, and let some action
happen. This concept was abandoned with Stackless 2.0,
among some others.
The virtue of frame pushing is, that it can avoid almost
all C stack creations, by postponing the next action
either directly or indirectly via the f_execute action
of a frame.

There are now two ways to integrate this with Stackless 3.0:
a) the lame way: Allow for frame pushing style, but don't care
    about it. That will be the intermediate solution that will
    be ready next week. The existing IronPort scheduling scheme
    will be able run run underneath Stackless 2.0 concepts, in a
    single tasklet. This works, but it is, as said, a lame hack.
b) Rewrite the full Stackless 2.0 implementation using the
    frame pushing style. That is not as hard as I feared.
    But it allows us to avoid most of the C stack saving
    operations in tasklet switches and chanel operations as
    well, and I can express IronPort's scheduling scheme in
    terms of tasklets and channels. Their processes will then
    not be run inside one tasklet, but they wil *be* tasklets.

I will implement (or have already implemented) a) to get
things started. But as soon as possible, we should start to
work on b), since this is the real way to go.

Let me give you an example:

Today, there is an internal function schedule().
It's task is to switch to the next tasklet to be run.
The implementation is brute-force right now:
Identify the current frame's C stack state, save it in
a CStack structure, examine the target's CStack structure,
modify the C stack and return. This is the current switching

After re-enabling the most basic principles of Stackless 1.0,
there is now a frame_dispatch function, which sits on top
of all interpreter incarnations.
Regular Python function call:
The idea is, instead of having multiple interpreters being
nested on the C stack, we just save the current execution
state in the current frame, push the newly created frame
of the function to be called, and return to the dispatcher.
This is done by returning an Py_UnwindToken object instead
of the (not yet computed) function result.
The dispatcher will run this new frame on the top of the
frame stack, without nesting another interpreter level.
This works today, already.

In the same context, alternative a) almost works today,
already, despite a little change to the function interface
since Stackless 1.0.

Now, Stackless 2.0 has already introduced so-called CFrames,
which are frames, which don't have to run an interpreter.
They are simply calling a CFunction, and they can be the head
of a tasklet. They can spawn a Python function, but they needn't.
But all the switching code was in the brute-force
style of C stack modification.

Now, here comes the big harmonization idea:
However a frame is a Python frame or a C frame, they are
all compatible (will be compatible).
If a frame shows up in the dispatcher, it will be executed.

What about our schedule() example?
There will exist an schedule_nr() function in addition.
This is a non-recursive schedule function.
Instead of the outlined action of changing the C stack,
schedule_nr() can instead postpone its action by creating
a CFrame, which has an f_execute bound to a special
schedule_callback function.
Instead of actually switching stacks, schedule_nr builds
such a CFrame object, pushes it on the frame stack, and
returns the Py_UnwindToken.
That means, the execution of this action passes up above
the current interpreter, and back into the dispatcher.

The dispatcher will simply grab this CFrame's f_execute and
call it. This is the schedule_callback function, which
now determines, that it is not run from inside of an
interpreter, but by the toplevel dispatcher. It will then
examine the target tasklet's frame's CStack structure,
and if that one doesn't exist, then these two tasklets,
the previous and the future one, are compatible, since
they both can be run from within this dispatcher function.
The action to be taken will then only be to swap the
current tasklet, modify the tstate->frame, and to execute
the next frame, as always. So we have avoided to to any
C stack modifications at all.

In the case that the target frame already *had* a CStack
attached, the switching will now occour the same way as
it is done in Stackless 2.0 . But this case is rather
seldom, I think it occours in less than 20 percent.

In effect, I have to re-write all stack switching functions,
to provide an XXX_nr version, which tries to avoid to do
the hardware switching by unwinding the C stack.

This is the concept, which I've been missing for months,
and it is definately the way to go. This way of tasklet
switching in much more effective on certain platforms,
and it can be done without a single line of assembly code.
It also opens up a "hardware-less" alternative of Stackless,
with the same or some more restrictions of Stackless 1.0,
but things will work to some extent, everywhere.

But there *are* situations where this concept cannot be
implemented easily, since we are either sitting in some C
extension, or we are in a complicated abstract.c function,
which is implemented using lots of possible recursive calls
into Python code, which we don't want to have to patch.

Stackless 1.0 could not switch tasks in these contexts.
Stackless 2.0 always switched tasks, the hard way.
Stackless 3.0 does the appropriate thing, one of both.

However, frame-pushing is more efficient than C stack
modifications, since the generated code is completely
predictable by the C compiler, and there is no additional
memory allocation for the copy of the C stack, and no time
involved in moving it around.

For Stackless, this merge of technologies has a couple of
benefits: Stackless can be run without hardware support,
switching is more effective in most cases, and this also
grows a good criteria for automated scheduling, like in
ancient microthreads: Automated scheduling is safe when
it is done in the toplevel interpreter, with no other C
levels involved.

For IronPort, the benefit is two-fold: They can use
the a) alternative, to get their existing code to
run quite unmodified, soon, using alternative a).
After that step, they can re-formulate their task
switching in terms of tasklets, which now support
the frame pushing protocol as well, as depicted in the
lengthy example. The code *has* been written in the quite
cumbersome continuation passing style of frame pushing,
but there is no reason to change it. It just would be
adjusted to use a derived tasklet type to express threads.

For new stuff to be implemented, they now have both ways:
Use the hard way to provide callbacks, frame pushing,
and call XXX_nr function which return an Py_UnwindToken,
or simply be straight, don't care and just switch tasklets
by calling schedule(), using the hardware protocol.

Yes I know, this is no simple stuff, and I really should
start to write those promised Wiki pages.

cheers - chris

Christian Tismer             :^)   <mailto:tismer at tismer.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship* http://starship.python.net/
14109 Berlin                 :     PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?   http://www.stackless.com/

Stackless mailing list
Stackless at www.tismer.com

More information about the Stackless mailing list