[Stackless] Multi core/cpu?

Allen Fowler allen.fowler at yahoo.com
Fri Oct 19 23:38:49 CEST 2007

Santiago Gala <sgala at apache.org> wrote: 

> How could you even make something simple as a c-style for loop?  The
> index variable would be clobbered by the other thread running the same
> code.

Well, there is a piece you have not considered: the stack. Each thread
shares memory, except the processor registers and the  stack. Local
variables are allocated from the stack in most if not all languages,
which means the "automatic" (non extern/static) C vars are allocated
separatelly for each thread. 

Ahh.  OK.  I get it.

So, this, in effect, is kind of like making a process were all the code, globals, and constants exist in "non-read-only" shared libraries.

Local vars in the "magical" stack are kind of like the the non-shared memory areas of the above process.

Of course, this view does not explain why a stack is called a "stack" and not "thread-private work area".  No doubt, there is a very good reason.

(Something to do with thread-specific function call/return?)

Now in Python, most code tends to avoid the use of python global variables.

I presume this coding style is not feasible in C and friends?

Any introductory book con languages or
compilers should explain how a stack-allocated program works.

Any recommendations? :)

Thread switching is way cheaper than process switching, as the OS needs
only to save/restore registers and the stack pointer (and the stack
needs to get into cache, etc.). In a process switch, typically all the
cache is stale and needs to be refreshed from the new process memory.

Is thread vs. process switching just an issue of size of data needing to be moved over the CPU-Memory bus, or is there more to it?

Does a CPU see any difference between a thread and process switch?

> I really don't want to fork/spawn/whatever a 2nd/3rd/4th 100MB+ Python
> process.

Take a look into the wide finder code from Fredrich Lundh
( http://effbot.org/zone/wide-finder.htm ) starting with my code
( http://memojo.com/~sgala/blog/2007/09/29/Python-Erlang-Map-Reduce ) ,
and you'll see how it can pay to spawn a process for a lot of programs.
The timing goes from 1.9 secs  (his optimized serial code, 1GB input) to
0.9 secs in a two core machine.

Wow. Fascinating reading.

So, each worker thread blocks on IO to it's spawned process thus releasing the Python GIL.

Did I get that right? (I did not fully understand each line of code..)

I can see that in some applications, the spawned worker process could be quite small.  

Again, a lot of effort nowadays is not directed towards optimizing
"small" problems, but rather to make them scale. Things like "use 1000
machines to count hits per URL for 20000 log files totaling 3TBytes...

Yeah, I get that feeling.

It's great for the (cluster endowed) enterprise software developer, but not for those of us trying to make fast interactive single-user/single-machine apps.

In a very broad summary, it stops using the OS stack for the interpreter
internals, so that "microthreads" that can be switched very fast are
possible. (Correct my overgeneralized sentence, please)


So, thread / stack switching is fast, but not fast enough for a "Many Threaded" app.  

How, does stackless work then?   

Where are the tasklet's local variables stored?

It's the same amount of data, right?  Why would I run out "stack memory", but not "stackless memory"

As a side question, can tasklets (like threads) even safely access global (thus shared) variables, given the non-guaranteed order of execution?

Thank you again,

Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.stackless.com/pipermail/stackless/attachments/20071019/8c13f72a/attachment.htm>

More information about the Stackless mailing list