Draft V.1.5, Christian Tismer 990612
Updated for release 0.3 on 990712, but still needs more work to reflect all the changes. Please see also the file changelog.txt for the last update.
Status of this paper:
This is a draft in flux and still needs
some work. The implementation seems to fulfil all of my needs so far.
I want to thank all people who helped me to turn this idea into reality.
Special thank applies to
Tim Peters for all his input and involvement
Sam Rushing for the inspiration and the reason
Axel Roepke for all the long and constructive discussions
Jean-Claude Wippler for helping to get the final bits done
Guido for the promised hug :-)
This paper might cause some damage to your brain. At least mine had to undergo a couple of changes before I was able to think stackless enough to write a stackless version of the interpreter. Not the code which I will describe is difficult, but the problem is by itself a brainteaser.
Fasten seatbelts and let me lead you through the whole story, on
your own risk.
The current source snapshot, together with a
Python 1.5.2 compatible build of python15.dll is available at
ftp://ftp.pns.cc/pub/stackless_990713.zip
ftp://ftp.pns.cc/pub/stackless_990713_win32.zip
The standard Python interpreter is written in C. Every Python action is somehow performed by some C code. Whenever a piece of Python code is executed, a new incarnation of the interpreter loop is created and executed.
First of all, let's see what it means to have a C stack.
Consider a Python function a() which calls a Python function b() which calls a Python function c(). In the context of function c(), the C interpreters of all three functions are still alive. They are keeping their state on the C stack. When the Python functions come to an end or an exception is raised, the C functions are popped off the C stack. This is called "unwinding the stack".
In that sense, Python is not so different from other C programs, which are usually all stack based. But this is not the full story, since Python does quite a lot more than using the C stack.
Every running piece of Python code also has an associated Frame object. A Frame object is something like a running instance of a code object. Frames are used to hold local and global variables, to maintain the value stack of the byte code interpreter, and some other housekeeping information.
These Frames are chained together in a last-in/first-out manner. They make up a stack by themselves. And they do this in a way quite similar to the C stack.
You might say "well, fine, but why should I care?"
I will try to explain why I believe that being stackless is relatively easy, and that it can make very much sense.
Just a few:
Recursion depth is limited by the C stack.
The C stack
has a fixed size, and therefore the stack cannot grow infinitely.
The frame stack is not limited. I think it is good to keep recursion
depth limited, but not as a matter of an implementation detail. This
should be a user option. The C stack should be reserved for C
modules which really need it.
The C stack holds references to
objects.
Well, eval_code is rather clean in this respect, but
other C functions may pile up on the stack as well. References to
function parameters are kept on the C stack until a function
returns.
The C stack holds execution state.
Information of the frame's current program and stack counter is
hidden in variables on the C stack and cannot be modified.
Ergo: The C stack limits the order
of execution.
You might not care about order of execution, as a
standard user
But whenever you try to play with some special
concepts, like co-routines, generators, or even cooperative garbage
collection, then you always can find some rather easy implementation
in Python, but you cannot try it with C code.
Removing the C stack is cheap.
As
I try to show in this paper, the implementation effort is much
smaller than one would think. Execution speed is nearly the same.
But the new possibilities of having an explicit stack are many, and
they cannot even tried in the moment.
Coroutines can be incredibly fast
As we will see in the following, by decoupling the frames from
the C stack, coroutines can be implemented so fast, that it might
become feasible to use them for quite a number of problems.
Switching between two coroutine frames needs just a builtin function
call which is much cheaper than calling a Python function.
Last but not least: PERL has already no longer a C stack.
They will have some good reasons for this. Python has no reason
to be limited.
To conclude: I want to provide the same level of flexibility
for the C API as we have already in Python, open Python for new,
efficient algorithmic approaches, at low implementation and runtime
cost.
Task |
Details |
support for Sam Rushing |
This was the initializer. Sam wants to try co-routines. I want to help doing this in an efficient, platform independent way. |
get rid of the C stack |
see above. The stack is in my way. |
allow for pluggable interpreters |
This happened by chance. I realized that in order to make stackless map, filter and reduce, it is necessary to build tiny custom interpreters. |
do the minimum changes to the existing code |
Python will change anyway. The less I write, the lesser I have to change again, and it will be easier to compare. |
stay compatible to existing compiled extension modules |
stackless Python should be a drop-in replacement. |
find out how to build stackless extension modules |
stackless extension modules which call back into the interpreter can be built rather easily. The proof is given by stackless map. |
As a proof of concept, and to learn how this must work, I had to write a Python version which reaches all of the above goals. Let's see what came out of this.
The basic rules are simple:
Avoid recursive calls into the interpreter.
Store all runtime information in frames.
Allow frames to be restarted. Just add to the frame structure, don't break anything.
Store the interpreter function in the frames.
Have one central frame handler, make eval_code2 into just one interpreter
Provide new functions but keep backward compatible stub functions.
Rule 1 is obvious but looks hard to fulfil.
Rule 2 is obvious, too. The frame chain becomes our true stack. No information should be hidden in the C stack while a Python function is being called.
Rule 3 is one of the key ideas. We come to it soon.
Rule 4 is just consequent. If a frame is to know everything about the running code, then it also should know which the interpreter is. As a side effect, frames can call each other at wilt, without having to know more than that this is just a runnable frame.
Rule 5 is again consequent. The central frame handler is the very minimum which we need. It does nothing more than juggling the frame calls and passing results. With a single concept of frames, we can run any kind of interpreters together, as long as they obey the Python Object Protocol.
Well, Rule 6 is clear. We want to be still standard python with the standard C API just extended.
Let's have a look into the code. The first file to look into is ceval.c. We compare the old and the new version. If you try to figure out what happens on a Python function call, you will first think it is impossible to do that without the C stack.
Have a look at function eval_code2, at the case CALL_FUNCTION. Many different cases are handled here, other functions are called, like PyEval_CallObjectWithKeywords. There, a number of decisions are made, which in turn cause other function calls, and finally, we end up somewhere deeply nested, with either a call to call_builtin or call_function. For getting a result, either a C function is called, or a new interpreter incarnation handles the call, finally via eval_code2 again.
In order to do this call, the calling function somehow prepares parameters, keeps a reference to the parameters, lets the evaluation happen, releases the parameters and returns the result.
My biggest fear was that I would have to rewrite all of this. But this is not true.
If we just avoid to do the final call to eval_code2, we are almost done. Look into the old version of call_function, and compare it to ....
result = eval_code2( (PyCodeObject *)PyFunction_GetCode(func), PyFunction_GetGlobals(func), (PyObject *)NULL, &PyTuple_GET_ITEM(arg, 0), PyTuple_Size(arg), k, nk, d, nd, class); Py_DECREF(arg); PyMem_XDEL(k); return result; }
....the new version. Instead of calling the interpreter, we end up with this piece of code:
f = eval_code2_setup( (PyCodeObject *)PyFunction_GetCode(func), PyFunction_GetGlobals(func), (PyObject *)NULL, &PyTuple_GET_ITEM(arg, 0), PyTuple_Size(arg), k, nk, d, nd, class); if (f != NULL) { f->f_hold_ref = arg; /* the decref will happen on frame disposal */ result = Py_UnwindToken; } else result = NULL; PyMem_XDEL(k); return result; }
eval_code2_setup is a function which just prepares a new frame, in order to be run later. The frame holds all references to parameters, and the new field f_hold_ref takes the role to keep a reference to arg. Instead of actually performing the call now, call_function returns a special object, Py_UnwindToken as its value. This value is passed as a result through the actual pile of invoked C functions until it is caught in the calling interpreter (yes, we are back at the CALL_FUNCTION opcode). The interpreter checks this value and knows that a call is pending, so the current frame must be left, and a new frame has been put onto the stack, which should be called now.
Why is this possible, and why does it appear to be so easy?
The reason is:
Python is already nearly stackless. All the
functions try to do nothing more than just to prepare for the next
interpreter call, which will do the actual evaluation. After as code
object and its parameters have been checked for errors, the runtime
error checking is already deferred to the responsible interpreter.
But this does not mean that we need to call the interpreter right now
where we are! Got it?
In other words: Allmost all calls into the interpreter turn out to be tail recursive. After calling parameters have been checked, there is no need to actually call the next function. It is ok to generate a frame on the frame stack, which is ready to be run by the next interpreter involved, but there is no need to do this in the current pile of active C functions.
Now, this is a very central clue in this "game": Whenever a function believes that it is done by running eval_code2 and returning a value, it deferres all error handling to "the" interpreter. What I do is nothing more than to defer the whole call to just "one" interpreter, which has the same semantics. It is just a special form of tail recursion what happens here. It does not matter if a NULL value is treated as an exception right here, in the 25th recusion of some function call, or somewhere else, in a top-level interpreter incarnation.
By using this principle, it was possible to make stackless versions of all of the interpreter functions. The conversion works like this:
If an interpreter function (aka. a function which calls back into the interpreter) just tries to prepare a call to the interpreter, I changed this into the creation of a new runnable frame on the frame stack, together with a return value of Py_UnwindToken. This is a special object with the only purpose to tell an interpreter, that it should leave the interpreter loop and return immediately, since something special happened.
Another approach to explain this:
The
interpreter functions are (almost) all working with PyObjects. Return
values can be NULL, or a proper PyObject. NULL has the semantics of
an error which has to be handled, and it will cause the interpreters
to unwind the stack by an exception. This gives us just a two-valued
logic, encoded by return values being either PyObjects or NULL. By
introducing th Py_UnwindToken, I extended this to three-valued logic.
NULL |
An error occoured |
Py_UnwindToken |
please dispatch a frame |
any other PyObject |
just a return value |
Since all these values can be passed through the existing interpreter functions, I saved a major rewrite and had just to take care to catch the right places to change.
If you are still with me, now the time has come to change your understanding of functions, stacks and return values.
Let's have a look into the new, central "interpreter" function in ceval.c Actually it is no interpreter, but a function which repeatedly evaluates the top of the frame stack, using that frame's interpreter.
static PyObject * PyEval_Frame_Dispatch() { PyObject * result; PyFrameObject *f; PyThreadState *tstate = PyThreadState_GET(); void *self; f = tstate->frame; if (f==NULL) return NULL; /* the initial frame should really belong to us */ f->f_signature = &self; result = NULL; while (1) { result = f->f_execute(f, result) ; f = tstate->frame; if (result == Py_UnwindToken) { /* this is actually the topmost frame */ /* peel off an optional return value */ result = (PyObject *) f->f_signature; /* and mark the frame as our own */ f->f_signature = &self; } else if (f==NULL || f->f_signature != &self) break; } return result; }
Just try to understand the central loop. The dispatcher picks the topmost frame, the current result, and calls the frame's f_execute function. This is the place where we always return to, regardless for what reason we ran a frame or why we left it.
This loop will continue to evaluate frames, until it reaches some endcondition. It doesn't care about the frame stack which might grow as Python functions are called, or shrink when Python functions return or raise exceptions. It will simply run the topmost frame, whatever is there, whatever interpreter it uses.
You might wonder about the result variable which looks a little funny. This is indeed the result which the execution of a frame came up with. We wouldn't even need to examine this variable, since the executing interpreters know wether they expect a result and how to handle it. The reason why we do check for a Py_UnwindToken is just to assign our ownership to a probably newly created frame, and to handle the special case of passing a return value. More on that later.
Let me put some refreshment into this dry code discussion. We have already come quite far by understanding the little frame dispatcher and the special unwinding token. The frame dispatcher runs whatever frame is there, the unwind token just returns from an interpreter, in order to let a different frame be executed.
If you read the last sentence twice, you will understand why I'm so keen on having this dispatcher.
The curent use is to return an unwind token in order to run a newly created frame. But there is much more, since it doesn't need to be a new frame, it might just be any frame!
As a quick example of this idea, let me sketch the pseudo-code of a coroutine-transfer.
Given two Python functions which want to act as coroutines, which means they can transfer control flow to each other.
XXX I want to insert one of Tim Peter's examples here. I looked into his 5 years old code which is pretty, but since he said that nowadays he would do this in a different way, I'll better wait for something new to appear?
As of 990712, I have a continuation module ready which will be
published soon. This will give a good source of examples.
XXX
The coroutine implementation needs to switch between two different execution states. In terms of our frame stack, this would just mean to swap the current frame chain with a different one.
Here the pseudocode:
PyObject *coroutine_transfer(controller, arg) PyCoObject *controller; PyObject *arg; { PyFrameObject ffrom, fto; PyThreadState *tstate = PyThreadState_Get(); ffrom = state->frame; fto = controller->frame; if (ffrom->f_signature != fto->f_signature) { /* set some exception here */ return NULL; } controller->frame = ffrom; tstate->frame = fto; fto->f_signature = arg; return Py_UnwindToken; }
The idea of swapping frame stacks is only limited by the rule, that the signatures of the swapped frames have to be the same. This is the whole thing to do, leading to an extremely efficient coroutine implementation.
In order to stay backward compatible, all the known recursive interpreter functions needed to be retained. To avoid duplicates of all the code, the following technique was applied:
The old function header was copied, the new function got an "_nr" appended to its name. The new function was changed to be non-recursive. If the old version was short enough or had to be changed anyway, code was indeed copied. An example for this is the well-known eval_code:
/* Backward compatible interface */
PyObject * PyEval_EvalCode(co, globals, locals) PyCodeObject *co; PyObject *globals; PyObject *locals; { PyFrameObject *frame; frame=eval_code2_setup(co, globals, locals, (PyObject **)NULL, 0, (PyObject **)NULL, 0, (PyObject **)NULL, 0, (PyObject *)NULL); if (frame != NULL) return PyEval_Frame_Dispatch(); else return NULL; } PyObject * PyEval_EvalCode_nr(co, globals, locals) PyCodeObject *co; PyObject *globals; PyObject *locals; { PyFrameObject *frame; frame = eval_code2_setup(co, globals, locals, (PyObject **)NULL, 0, (PyObject **)NULL, 0, (PyObject **)NULL, 0, (PyObject *)NULL); if (frame != NULL) return Py_UnwindToken; else return NULL; }
The difference is obvious: While the backward compatible version creates a ready-to-run frame (which is put onto the frame stack by eval_code2_setup) and runs it to the end by PyEval_Frame_Dispatch, the other just levaes that frame and returns Py_UnwindToken.
An example where the original version was expressed in terms of the new version, we have a look at PyEval_CallObjectWithKeywords:
PyObject * PyEval_CallObjectWithKeywords(func, arg, kw) PyObject *func; PyObject *arg; PyObject *kw; { PyObject *retval = PyEval_CallObjectWithKeywords_nr(func, arg, kw); if (retval == Py_UnwindToken) retval = PyEval_Frame_Dispatch(); return retval; } PyObject * PyEval_CallObjectWithKeywords_nr(func, arg, kw) PyObject *func; PyObject *arg; PyObject *kw; { ternaryfunc call; PyObject *result; if (arg == NULL) arg = PyTuple_New(0); else if (!PyTuple_Check(arg)) { PyErr_SetString(PyExc_TypeError, "argument list must be a tuple"); return NULL; } else ... ...
The old one was turned into a call to the new, followed by a
dispatcher call.
Note that in the case of
PyEval_CallObjectWithKeywords, the "new" function is
exactly the same code as the original one. The difference is just
that the used internal functions of ceval.c, call_builtin
and call_function have changed their semantics to be
non-recursive versions. PyEval_CallObjectWithKeywords_nr does
not know that it now can return something special, since an
Py_UnwindToken is just a PyObject. So the "new"
function in this case is the new version of
PyEval_CallObjectWithKeywords which just takes care that no
Py_UnwindToken can leak out to a C function which is not
stackless-aware.
The backward compatible version has exactly the same semantics as
in the original code. It runs some code and returns either a PyObject
or a NULL, indicating an exception.
Extension modules which wish
to implement stackless C functions in a similar fashion as shown
here, will use the new function instead.
New stackless functions in ceval.c:
PyEval_CallObjectWithKeywords_nr
PyEval_CallObject_nr
PyEval_EvalCode_nr
PyEval_Frame_Dispatch
New stackless functions in pythonrun.c:
PyRun_SimpleFile_nr
PyRun_String_nr
PyRun_File_nr
New stackless functions in bltinmodule.c:
builtin_map_nr done
builtin_eval_nr todo
builtin_filter_nr todo
builtin_reduce_nr todo
builtin_apply_nr done
The builtin_ functions will finally replace their originals, since they are not used elsewhere. builtin_eval and builtin_apply are straight forward to implement and just not done yet.
The major problem are functions which cannot easily be converted since they are not tail recursive preparations of a singe eval_code call, but repetitive calls are performed. To obtain a stackless version of these, a number of actions was necessary, and I implemented this as a proof of concept just for builtin_map:
XXX Missing:
Explaining interpreter-like functions like map_nr
New fields in the frame structure
Where to move from here
Christian Tismer who really needs feedback :-)
© 1999 by PNS - Professional
Net Service GmbH
.