Stackless Python

Draft V.1.5, Christian Tismer 990612

Updated for release 0.3 on 990712, but still needs more work to reflect all the changes. Please see also the file changelog.txt for the last update.

Status of this paper:
This is a draft in flux and still needs some work. The implementation seems to fulfil all of my needs so far.

Acknowledgements

I want to thank all people who helped me to turn this idea into reality.

Special thank applies to

A little warning before

This paper might cause some damage to your brain. At least mine had to undergo a couple of changes before I was able to think stackless enough to write a stackless version of the interpreter. Not the code which I will describe is difficult, but the problem is by itself a brainteaser.

Fasten seatbelts and let me lead you through the whole story, on your own risk.
The current source snapshot, together with a Python 1.5.2 compatible build of python15.dll is available at

ftp://ftp.pns.cc/pub/stackless_990713.zip
ftp://ftp.pns.cc/pub/stackless_990713_win32.zip

What does it mean to be stackless?

The standard Python interpreter is written in C. Every Python action is somehow performed by some C code. Whenever a piece of Python code is executed, a new incarnation of the interpreter loop is created and executed.

First of all, let's see what it means to have a C stack.

Consider a Python function a() which calls a Python function b() which calls a Python function c(). In the context of function c(), the C interpreters of all three functions are still alive. They are keeping their state on the C stack. When the Python functions come to an end or an exception is raised, the C functions are popped off the C stack. This is called "unwinding the stack".

In that sense, Python is not so different from other C programs, which are usually all stack based. But this is not the full story, since Python does quite a lot more than using the C stack.

Every running piece of Python code also has an associated Frame object. A Frame object is something like a running instance of a code object. Frames are used to hold local and global variables, to maintain the value stack of the byte code interpreter, and some other housekeeping information.

These Frames are chained together in a last-in/first-out manner. They make up a stack by themselves. And they do this in a way quite similar to the C stack.

You might say "well, fine, but why should I care?"

I will try to explain why I believe that being stackless is relatively easy, and that it can make very much sense.

Why I believe that the C stack should vanish

Just a few:


To conclude: I want to provide the same level of flexibility for the C API as we have already in Python, open Python for new, efficient algorithmic approaches, at low implementation and runtime cost.

Goals of my implementation

Task

Details

support for Sam Rushing

This was the initializer. Sam wants to try co-routines. I want to help doing this in an efficient, platform independent way.

get rid of the C stack

see above. The stack is in my way.

allow for pluggable interpreters

This happened by chance. I realized that in order to make stackless map, filter and reduce, it is necessary to build tiny custom interpreters.

do the minimum changes to the existing code

Python will change anyway. The less I write, the lesser I have to change again, and it will be easier to compare.

stay compatible to existing compiled extension modules

stackless Python should be a drop-in replacement.

find out how to build stackless extension modules

stackless extension modules which call back into the interpreter can be built rather easily. The proof is given by stackless map.

As a proof of concept, and to learn how this must work, I had to write a Python version which reaches all of the above goals. Let's see what came out of this.

The Concept

The basic rules are simple:

  1. Avoid recursive calls into the interpreter.

  2. Store all runtime information in frames.

  3. Allow frames to be restarted. Just add to the frame structure, don't break anything.

  4. Store the interpreter function in the frames.

  5. Have one central frame handler, make eval_code2 into just one interpreter

  6. Provide new functions but keep backward compatible stub functions.

Rule 1 is obvious but looks hard to fulfil.

Rule 2 is obvious, too. The frame chain becomes our true stack. No information should be hidden in the C stack while a Python function is being called.

Rule 3 is one of the key ideas. We come to it soon.

Rule 4 is just consequent. If a frame is to know everything about the running code, then it also should know which the interpreter is. As a side effect, frames can call each other at wilt, without having to know more than that this is just a runnable frame.

Rule 5 is again consequent. The central frame handler is the very minimum which we need. It does nothing more than juggling the frame calls and passing results. With a single concept of frames, we can run any kind of interpreters together, as long as they obey the Python Object Protocol.

Well, Rule 6 is clear. We want to be still standard python with the standard C API just extended.

Problem analysis

Let's have a look into the code. The first file to look into is ceval.c. We compare the old and the new version. If you try to figure out what happens on a Python function call, you will first think it is impossible to do that without the C stack.

Have a look at function eval_code2, at the case CALL_FUNCTION. Many different cases are handled here, other functions are called, like PyEval_CallObjectWithKeywords. There, a number of decisions are made, which in turn cause other function calls, and finally, we end up somewhere deeply nested, with either a call to call_builtin or call_function. For getting a result, either a C function is called, or a new interpreter incarnation handles the call, finally via eval_code2 again.

In order to do this call, the calling function somehow prepares parameters, keeps a reference to the parameters, lets the evaluation happen, releases the parameters and returns the result.

My biggest fear was that I would have to rewrite all of this. But this is not true.

Problem solution

If we just avoid to do the final call to eval_code2, we are almost done. Look into the old version of call_function, and compare it to ....

    result = eval_code2(
        (PyCodeObject *)PyFunction_GetCode(func),
        PyFunction_GetGlobals(func), (PyObject *)NULL,
        &PyTuple_GET_ITEM(arg, 0), PyTuple_Size(arg),
        k, nk,
        d, nd,
        class);

    Py_DECREF(arg);
    PyMem_XDEL(k);

    return result;
}

....the new version. Instead of calling the interpreter, we end up with this piece of code:

    f = eval_code2_setup(
        (PyCodeObject *)PyFunction_GetCode(func),
        PyFunction_GetGlobals(func), (PyObject *)NULL,
        &PyTuple_GET_ITEM(arg, 0), PyTuple_Size(arg),
        k, nk,
        d, nd,
        class);
    if (f != NULL) {
        f->f_hold_ref = arg;    /* the decref will happen on frame disposal */
        result = Py_UnwindToken;
    }
    else result = NULL;

    PyMem_XDEL(k);
    return result;
}

eval_code2_setup is a function which just prepares a new frame, in order to be run later. The frame holds all references to parameters, and the new field f_hold_ref takes the role to keep a reference to arg. Instead of actually performing the call now, call_function returns a special object, Py_UnwindToken as its value. This value is passed as a result through the actual pile of invoked C functions until it is caught in the calling interpreter (yes, we are back at the CALL_FUNCTION opcode). The interpreter checks this value and knows that a call is pending, so the current frame must be left, and a new frame has been put onto the stack, which should be called now.

Why is this possible, and why does it appear to be so easy?

The reason is:
Python is already nearly stackless. All the functions try to do nothing more than just to prepare for the next interpreter call, which will do the actual evaluation. After as code object and its parameters have been checked for errors, the runtime error checking is already deferred to the responsible interpreter. But this does not mean that we need to call the interpreter right now where we are! Got it?

In other words: Allmost all calls into the interpreter turn out to be tail recursive. After calling parameters have been checked, there is no need to actually call the next function. It is ok to generate a frame on the frame stack, which is ready to be run by the next interpreter involved, but there is no need to do this in the current pile of active C functions.

Now, this is a very central clue in this "game": Whenever a function believes that it is done by running eval_code2 and returning a value, it deferres all error handling to "the" interpreter. What I do is nothing more than to defer the whole call to just "one" interpreter, which has the same semantics. It is just a special form of tail recursion what happens here. It does not matter if a NULL value is treated as an exception right here, in the 25th recusion of some function call, or somewhere else, in a top-level interpreter incarnation.

By using this principle, it was possible to make stackless versions of all of the interpreter functions. The conversion works like this:

If an interpreter function (aka. a function which calls back into the interpreter) just tries to prepare a call to the interpreter, I changed this into the creation of a new runnable frame on the frame stack, together with a return value of Py_UnwindToken. This is a special object with the only purpose to tell an interpreter, that it should leave the interpreter loop and return immediately, since something special happened.

Another approach to explain this:
The interpreter functions are (almost) all working with PyObjects. Return values can be NULL, or a proper PyObject. NULL has the semantics of an error which has to be handled, and it will cause the interpreters to unwind the stack by an exception. This gives us just a two-valued logic, encoded by return values being either PyObjects or NULL. By introducing th Py_UnwindToken, I extended this to three-valued logic.

NULL

An error occoured

Py_UnwindToken

please dispatch a frame

any other PyObject

just a return value

Since all these values can be passed through the existing interpreter functions, I saved a major rewrite and had just to take care to catch the right places to change.

If you are still with me, now the time has come to change your understanding of functions, stacks and return values.

The Frame Dispatcher

Let's have a look into the new, central "interpreter" function in ceval.c Actually it is no interpreter, but a function which repeatedly evaluates the top of the frame stack, using that frame's interpreter.

static PyObject *
PyEval_Frame_Dispatch()
{
    PyObject * result;
    PyFrameObject *f;
    PyThreadState *tstate = PyThreadState_GET();
    void *self;

    f = tstate->frame;
    if (f==NULL) return NULL;
    /* the initial frame should really belong to us */
    f->f_signature = &self;
    result = NULL;

    while (1) {
        result = f->f_execute(f, result) ;
        f = tstate->frame;
        if (result == Py_UnwindToken) {
            /* this is actually the topmost frame */
            /* peel off an optional return value */
            result = (PyObject *) f->f_signature;
            /* and mark the frame as our own */
            f->f_signature = &self;
        }
        else if (f==NULL || f->f_signature != &self) break;
    }
    return result;
}

Just try to understand the central loop. The dispatcher picks the topmost frame, the current result, and calls the frame's f_execute function. This is the place where we always return to, regardless for what reason we ran a frame or why we left it.

This loop will continue to evaluate frames, until it reaches some endcondition. It doesn't care about the frame stack which might grow as Python functions are called, or shrink when Python functions return or raise exceptions. It will simply run the topmost frame, whatever is there, whatever interpreter it uses.

You might wonder about the result variable which looks a little funny. This is indeed the result which the execution of a frame came up with. We wouldn't even need to examine this variable, since the executing interpreters know wether they expect a result and how to handle it. The reason why we do check for a Py_UnwindToken is just to assign our ownership to a probably newly created frame, and to handle the special case of passing a return value. More on that later.

Excursion: How to Implement Coroutines?

Let me put some refreshment into this dry code discussion. We have already come quite far by understanding the little frame dispatcher and the special unwinding token. The frame dispatcher runs whatever frame is there, the unwind token just returns from an interpreter, in order to let a different frame be executed.

If you read the last sentence twice, you will understand why I'm so keen on having this dispatcher.

The curent use is to return an unwind token in order to run a newly created frame. But there is much more, since it doesn't need to be a new frame, it might just be any frame!

As a quick example of this idea, let me sketch the pseudo-code of a coroutine-transfer.

Given two Python functions which want to act as coroutines, which means they can transfer control flow to each other.

XXX I want to insert one of Tim Peter's examples here. I looked into his 5 years old code which is pretty, but since he said that nowadays he would do this in a different way, I'll better wait for something new to appear?

As of 990712, I have a continuation module ready which will be published soon. This will give a good source of examples.
XXX

The coroutine implementation needs to switch between two different execution states. In terms of our frame stack, this would just mean to swap the current frame chain with a different one.

Here the pseudocode:

PyObject *coroutine_transfer(controller, arg)
    PyCoObject *controller;
    PyObject *arg;
{
    PyFrameObject ffrom, fto;
    PyThreadState *tstate = PyThreadState_Get();
    ffrom = state->frame;
    fto = controller->frame;
    if (ffrom->f_signature != fto->f_signature) {
        /* set some exception here */
        return NULL;
    }
    controller->frame = ffrom;
    tstate->frame = fto;
    fto->f_signature = arg;
    return Py_UnwindToken;
}

The idea of swapping frame stacks is only limited by the rule, that the signatures of the swapped frames have to be the same. This is the whole thing to do, leading to an extremely efficient coroutine implementation.

The New C API

In order to stay backward compatible, all the known recursive interpreter functions needed to be retained. To avoid duplicates of all the code, the following technique was applied:

The old function header was copied, the new function got an "_nr" appended to its name. The new function was changed to be non-recursive. If the old version was short enough or had to be changed anyway, code was indeed copied. An example for this is the well-known eval_code:

/* Backward compatible interface */

PyObject * 
   PyEval_EvalCode(co, globals, locals) 
       PyCodeObject *co; 
       PyObject *globals; 
       PyObject *locals; 
   { 
       PyFrameObject *frame; 
       frame=eval_code2_setup(co, 
                        globals, locals, 
                        (PyObject **)NULL, 0, 
                        (PyObject **)NULL, 0, 
                        (PyObject **)NULL, 0, 
                        (PyObject *)NULL); 
       if (frame != NULL) 
           return PyEval_Frame_Dispatch();    
       else 
           return NULL; 
   } 
     
PyObject * 
   PyEval_EvalCode_nr(co, globals, locals) 
       PyCodeObject *co; 
       PyObject *globals; 
       PyObject *locals; 
   { 
       PyFrameObject *frame; 
       frame = eval_code2_setup(co, 
                        globals, locals, 
                        (PyObject **)NULL, 0, 
                        (PyObject **)NULL, 0, 
                        (PyObject **)NULL, 0, 
                        (PyObject *)NULL); 
       if (frame != NULL) 
           return Py_UnwindToken; 
       else 
           return NULL; 
   } 

The difference is obvious: While the backward compatible version creates a ready-to-run frame (which is put onto the frame stack by eval_code2_setup) and runs it to the end by PyEval_Frame_Dispatch, the other just levaes that frame and returns Py_UnwindToken.

An example where the original version was expressed in terms of the new version, we have a look at PyEval_CallObjectWithKeywords:

PyObject *
   PyEval_CallObjectWithKeywords(func, arg, kw)
       PyObject *func;
       PyObject *arg;
       PyObject *kw;
   {
       PyObject *retval = PyEval_CallObjectWithKeywords_nr(func,    arg, kw);
       if (retval == Py_UnwindToken)
           retval = PyEval_Frame_Dispatch();
       return retval;
   }
   
   PyObject *
   PyEval_CallObjectWithKeywords_nr(func, arg, kw)
       PyObject *func;
       PyObject *arg;
       PyObject *kw;
   {
       ternaryfunc call;
       PyObject *result;
   
       if (arg == NULL)
           arg = PyTuple_New(0);
       else if (!PyTuple_Check(arg)) {
           PyErr_SetString(PyExc_TypeError,
                              "argument list must be a tuple");
           return NULL;
       }
       else
       ...
       ...

The old one was turned into a call to the new, followed by a dispatcher call.
Note that in the case of PyEval_CallObjectWithKeywords, the "new" function is exactly the same code as the original one. The difference is just that the used internal functions of ceval.c, call_builtin and call_function have changed their semantics to be non-recursive versions. PyEval_CallObjectWithKeywords_nr does not know that it now can return something special, since an Py_UnwindToken is just a PyObject. So the "new" function in this case is the new version of PyEval_CallObjectWithKeywords which just takes care that no Py_UnwindToken can leak out to a C function which is not stackless-aware.

The backward compatible version has exactly the same semantics as in the original code. It runs some code and returns either a PyObject or a NULL, indicating an exception.
Extension modules which wish to implement stackless C functions in a similar fashion as shown here, will use the new function instead.

New stackless functions in ceval.c:

New stackless functions in pythonrun.c:

New stackless functions in bltinmodule.c:

The builtin_ functions will finally replace their originals, since they are not used elsewhere. builtin_eval and builtin_apply are straight forward to implement and just not done yet.

The major problem are functions which cannot easily be converted since they are not tail recursive preparations of a singe eval_code call, but repetitive calls are performed. To obtain a stackless version of these, a number of actions was necessary, and I implemented this as a proof of concept just for builtin_map:

XXX Missing:
Explaining interpreter-like functions like map_nr
New fields in the frame structure
Where to move from here

Christian Tismer who really needs feedback :-)

© 1999 by PNS - Professional Net Service GmbH
.