Last week I got to take part in the CPython Core Developer Sprint in
Cambridge, hosted by ARM and brilliantly
organized by Diego Russo
— about ~50 core devs and guests were there, and I was excited to join as one
of the guests.
I had three main areas of focus:
-
C API: this was a follow up of what we discussed at the
C API summit at EuroPython. The current
C API is problematic, so we are exploring ideas for the development of
PyNI (Python Native Interface), whose design
will likely be heavily inspired by HPy. It’s
important to underline that this is just the beginning and the entire
process will require multiple PEPs. -
fancycompleter This is a
small PR which I started
months ago, to enable colorful
tab completions within the Python REPL. I wrote the original version of
fancycompleter 15 years ago,
but colorful completions work only in combination with PyREPL. Now PyREPL
is part of the standard library and enabled by default, so we can finally
upstream it. I hope to see it merged soon. -
“JIT stuff“: I spent a considerable amount of time talking to the
people who are working on the CPython JIT (in particular Mark, Brandt,
Savannah, Ken Jin and Diego). Knowledge transfer worked in both ways: I
learned a lot about the internal details of CPython’s JIT, and conversely I
shared with them some of the experience, pain points and gut feelings
which I got by working many years on PyPy.
In particular, on the first day I presented a talk titled Tracing JIT and real world Python (slides and source code).
What follows is an annotated version of the slides.
CPython’s new JIT and PyPy’s JIT share fundamental similarities, as they’re both
tracing JITs.
I spent ~7 years of my career optimizing existing code for PyPy at a
high-frequency trading firm, and I realized that I’m probably one of the few
people in the world with actual experience in optimizing real world Python
code for a tracing JIT.
I expect that some of the challenges which I faced will still be valid also
for CPython, and I wanted to share my experience to make sure that CPython
core devs are aware of them.
One lesson which I learned is that the set of benchmarks in pyperformance are
a good starting point, but they are not entirely representative of what you
find in the wild.
The main goal of the talk is not to present solutions to these problems,
but to raise awareness that they exist.
Until now CPython’s performance has been particularly predictable, there are
well established “performance tricks” to make code faster, and generally
speaking you can mostly reason about the speed of a given piece of code
“locally”.
Adding a JIT completely changes how we reason about performance of a given
program, for two reasons:
-
JITted code can be very fast if your code conforms to the heuristics
applied by the JIT compiler, but unexpectedly slow(-ish) otherwise; -
the speed of a given piece of code might depend heavily on what happens
elsewhere in the program, making it much harder to reason about
performance locally.
The end result is that modifying a line of code can significantly
impact seemingly unrelated code. This effect becomes more pronounced as the
JIT becomes more sophisticated.
The CPython JIT is still pretty new and doesn’t give huge speedups yet. I
expect that as it gets faster, its performance will start looking more and
more like PyPy’s.
I delivered this talk at the Core Dev Sprint: I expected my audience to be
familiar with CPython’s JIT, and wanted to draw parallels with PyPy’s one.
Since the audience of this blog is different, let me briefly explain
CPython’s JIT first.
The explanations of both JITs are necessarily short, incomplete and highly
simplified.
CPython JIT 101
Python source code is turned into bytecode. Bytecode is a sequence of
“opcodes” (LOAD_FAST, BINARY_OP, etc.), and the CPython VM is an
interpreter for those opcodes. Historically the VM was written by hand, and the
main loop consisted of a big switch statement which executed the code
corresponding to each opcode.
Nowadays things are different: the opcodes are written in a special DSL and
the main interpreter loop is generated from this
DSL. Additionally, the DSL describes how each opcode can be decomposed into
multiple “microops”.
When the interpreter detects a “hot loop”, it starts the JIT. The JIT
retroactively looks at the opcodes which were executed in the last iteration
of the loop, and creates a “linear trace” which contains the equivalent
microops. This process is called trace projection and the result is an
unoptimized trace of microops.
Then, the JIT can produce an optimized trace, by reordering and removing
redundant microops. Finally, the optimized trace is turned into executable
code using the “copy & patch” technique.
PyPy JIT 101
CPython’s Python interpreter is written in C, and then compiled into an
executable by gcc (or any other C compiler).
Similarly, PyPy’s Python interpreter is written in RPython, and then compiled
into an executable by rpython.
Under the hood, rpython applies two separate transformations to the source
code:
-
it turns each function into C code, which is then fed to
gccto get the
final executable; -
it turns each function into “jitcodes”, which is a way to represent
RPython’s IR (internal representation). For each RPython function, the
final./pypyexecutable contains its compiled representation (generated
by GCC) and its jitcode representation (embedded as static data into the
executable).
In a way, RPython’s jitcodes are equivalent to CPython’s microops, as they are
a low-level representation of the logic of each opcode.
When the interpreter detects a hot loop, it enters trace recording mode,
which is essentially an interpreter which executes the jitcodes: the result is
a linear unoptimized trace of all the jitcodes which were actually executed.
Similarly to CPython, PyPy then produces an optimized trace, which is then
sent to the JIT backend for actual native code generation.
Tracing JITs work by recording a trace of all microops which are
executed. The optimizer can then reason about what happens in the trace and
remove unneeded operations.
However, sometimes we encounter some operation which is a black box from the
point of view of the tracer: we call them “trace blocker”, because the tracing
JIT cannot see through them. In the case of CPython, this happens for
example, whenever we call any function implemented in C (because it doesn’t
have any correspondent “microop”).
This is a simple function that computes pi, generated by ChatGPT. Its
precise content is not important: what matters is that it’s a nice purely
numerical loop that the PyPy JIT can optimize very well.
Same function as above, with a call to hic_sunt_leones(). This is actually
an empty function which does absolutely nothing, but annotated in a
special way so that the PyPy JIT cannot “enter” it, so it effectively behaves
as trace blocker.
In this example we use the special pypyjit.residual_call to simulate a trace
blocker, but in real life we get it whenever we have a call to any
non-traceable function, in particular C extensions.
The clean version runs 42x faster on PyPy than CPython – that’s the JIT
working perfectly. But with just one untraceable function call added to the
loop, PyPy slows down to only 1.8x faster than CPython. That single line
destroyed most of the JIT’s effectiveness!
This happens because after the call the optimizer no longer knows whether its
assumptions about the world are still true, and thus must be much more
conservative.
I fear that for CPython, this will turn out to be a much bigger problem than
for PyPy, for two reasons:
-
nowadays it’s virtually impossible to run Python code without
using any C extension, either directly or indirectly. -
by construction, PyPy’s JIT can see much more than CPython’s
JIT. Remember the slide about “jitcodes”: any RPython function gets a
“jitcodes” equivalent, which means that the JIT can automatially trace
inside builtins and internals of the interpreter, whereas CPython can
trace only inside pure python code.
For example, PyPy’s JIT can trace through range(), zip, and enumerate()
automatically. CPython’s JIT currently cannot because they are implemented in
C. CPython could add special cases for these common functions, but the
general approach doesn’t scale.
The second big problem is what I call “data driven control flow”. This example
has been autogenerated by ChatGPT and it’s completely silly, but it’s a good
representation of what happens in real life code.
In this example, fn takes 9 variables, each of them can be None or a
number. The function starts with a sequence of if is None: .... The
function is then called repeatedly in a loop.
One of the assumption of tracing JITs is that control flow tends to stay on
the “hot path”, and that it’s enough to optimize that to get good performance.
But in a case like this, each combination of Noneness selects a different
path, and if we assume the data is evenly distributed, we find out that
there is no hot path.
Let’s see what happens when we execute on CPython and PyPy:
PyPy without JIT is “only” 2.3x slower than CPython, but when we enable the
JIT, it becomes much worse. This happens because of an exponential
explosion of code paths seen by the JIT.
In a normal compiler, an if statement is compiled as a diamond, and the
control flow merges after each if:
A tracing JIT by definition follows what’s happening during a concrete
execution, so it sees only a concrete path in the control flow, with “guards”
to ensure correctness:
When guard(a is None) fails enough times, we create a “bridge” and record
another linear trace, following again the concrete control flow that happens
now:
guard(a is None) ----> FAIL (side exit)
/ \
/ \
a = 0 pass
\ \
\ \
guard(b not None) guard(b not None)
/ /
/ /
b = 0 b = 0
\ \
\ \
... ...
Note how b = 0 is effectively duplicated now. By design, PyPy’s JIT never
merges execution flow.
Looking inside PYPYLOG confirms our theory: we get “exponential
tracing”. The JIT has to compile separate optimized code for every unique
combination of which parameters are None and which aren’t. With 9 parameters,
that could be up to 512 different combinations!
One possible mitigation is to rewrite conditional code to be “branchless” –
using arithmetic tricks instead of if statements. But this makes code ugly and
unreadable, and it’s not always possible.
Despite years of working on this, I never found a really good solution. There
were cases in which we had to continue running some piece of code on CPython
because I never managed to make the PyPy version faster.
This pattern happens quite a lot, although often is more subtle: in this silly
example all the ifs are nicely grouped together at the start, but in a long
trace they can be scattered in multiple places, and any kind of control flow
contributes to the problem, not only ifs. In Python, this includes any kind
of dynamic dispatch, exceptions, etc.
One possible solution for CPython’s JIT is to try to merge (some) traces to
avoid or limit the exponential explosion. However, it is worth underlining that
tracing JITs shine precisely when they can optimize a long linear trace: if
you try to compile shorter traces, you might quickly end up in a situation
which is equivalent to the “trace blocker” problem described earlier.
I suspect this might be a fundamental limitation of tracing JITs.
Compared to the other two problems, this is less serious, but it’s worth
mentioning because of prevalence of async (and thus implicitly generators)
in modern Python.
Here’s another silly function that counts Pythagorean triples using nested
loops. This is our baseline version using plain loops.
Here’s the same algorithm refactored to use a generator function for the
nested iteration. The “state of iteration” is implicitly stored inside the
local variables of frame object associated to the range_product generator.
Here’s the same functionality implemented as a traditional iterator class. The
“state of iteration” is explicitly stored as attributes of RangeProductIter.
On CPython, the generator version is ~29% slower than the explicit loops. The
iterator class is much slower, as one would intuitively expect.
However, on PyPy we see different results: RangeProductIter is basically
same speed as the baseline, while the generator version is slower. This
happens because in the case of RangeProductIter the JIT is able to see the whole
lifetime of the object and optimize it away entirely: instance variables
become local variables, the call to __next__ is inlined and we get the
equivalent of explicit nested loops.
However, generators are required to create a frame object and represent a
fundamental case in which the JIT cannot trace through them effectively. In
more complex real-world scenarios, we saw much worse slowdowns than these
examples show.
This is a collection of other miscellaneous problems that I had to deal with. Generally
speaking, we lack good support for tooling and profilers. CPython needs to
have a good story to explain people how to understand what’s happening when
the JIT is enabled.
Warmup is another big problem: in PyPy, very short programs tend to be slower
than CPython because JITting costs. Moreover warmup is not an easily definable
phase, as the linked paper shows. This is an area where currently CPython
shines, as its JIT is very fast. I think that it will become slightly slower
when it tries to optimize more aggressively, but hopefully warmup will
overall be a lesser problem than on PyPy.
Moreover, it’s very easy to accidentally make your code 2x, 5x or even 10x
slower by changing seemingly innocent pieces of code. This is another reason
why good tooling is essential.
Finally, the “long tail of JITting”: every loop and every guard gets a
counter, and we start JITting when it reaches a threshold. Given a
sufficiently long running program, all counters reach the threshold eventually
and we end up JITting much more than necessary, using too much memory and/or
thrashing the cache. In many cases I found beneficial to just disable the JIT
“after a while”, with manually tuned heuristics.
These are slides which I didn’t show during the live presentation, and show a
case where a tracing JIT can shine: since the JIT sees a complete trace of an
entire loop (including nested calls) it can easily removes a lot of temporary
objects which usually penalize Python performance.
In many cases, we can get the famous “zero-cost abstractions”.
Let’s look at a concrete example. We need to compute the barycenter of
triangles that are serialized in a binary format. Each triangle has three
points, each point has x and y coordinates. This simulates real world
protocols such as protobuf, capnproto, etc.
This is what we use a a baseline: a bare loop, using struct.unpack_from to read 6 floats at a time.
Here’s the “proper” object-oriented approach, similar to how modern
serialization libraries work. We create Triangle and Point classes that
provide a nice API for accessing the binary data. Each property access creates
new objects and calls struct.unpack_from. This is much more readable and
reusable, but creates many temporary objects.
Here’s how you’d use the object-oriented API. The code is much cleaner and
more readable than the bare loop version. But notice how many object creations
are happening: one Triangle object, six Point objects, plus all the
intermediate tuples from struct.unpack_from.
As expected, on CPython read_proto is much slower than the bare one,
roughly 6x slower. However, PyPy can fully optimize away all the
abstraction overhead introduced by Triangle and Point.
In PyPy jargon we call this form of allocation removal “virtuals” (because we
create “virtual objects” whose fields are represented as local variables) and
it’s probably the single most important optimization that PyPy does.
During my week in Cambridge I talked extensively with the CPython JIT devs
about this and I hope I convinced them that this is what they should aim for
😊.
Note also that read_proto is actually faster than read_loop. This
happens because in read_loop we do a single struct.unpack_from('dddddd', ...),
while in read_proto we do a succession of six individual
struct.unpack_from('d', ...). It turns out that the JIT is able to trace
into the second form but not into the first, which means that in read_loop
we actually need to allocate a pseudo-tuple at each iteration.
The funny part is that I did not expect to get this result. I had to take
the time to analyze the JIT traces of both versions to understand why
read_loop was slower. This is probably the best explanation of how
counterintuitive it is to reason about performance in a JITted world.
Acknowledgments
Thanks to Carl Friedrich Bolz-Tereick and
Hood Chatham for feedback on the slides and the
post.

