Tracing JITs in the real world @ CPython Core Dev Sprint

Last week I got to take part in the CPython Core Developer Sprint in
Cambridge, hosted by ARM and brilliantly
organized by Diego Russo
— about ~50 core devs and guests were there, and I was excited to join as one
of the guests.

I had three main areas of focus:

C API: this was a follow up of what we discussed at the
C API summit at EuroPython. The current
C API is problematic, so we are exploring ideas for the development of
PyNI (Python Native Interface), whose design
will likely be heavily inspired by HPy. It’s
important to underline that this is just the beginning and the entire
process will require multiple PEPs.
fancycompleter This is a
small PR which I started
months ago, to enable colorful
tab completions within the Python REPL. I wrote the original version of
fancycompleter 15 years ago,
but colorful completions work only in combination with PyREPL. Now PyREPL
is part of the standard library and enabled by default, so we can finally
upstream it. I hope to see it merged soon.
“JIT stuff“: I spent a considerable amount of time talking to the
people who are working on the CPython JIT (in particular Mark, Brandt,
Savannah, Ken Jin and Diego). Knowledge transfer worked in both ways: I
learned a lot about the internal details of CPython’s JIT, and conversely I
shared with them some of the experience, pain points and gut feelings
which I got by working many years on PyPy.

In particular, on the first day I presented a talk titled Tracing JIT and real world Python (slides and source code).

What follows is an annotated version of the slides.

Tracing JIT and real world Python

aka: what we can learn from PyPy

CPython’s JIT has a lot in common with PyPy
“Optimize for PyPy” ==> my job for ~7 years
Real world code != pyperformance
Challenges & lessons learned

CPython’s new JIT and PyPy’s JIT share fundamental similarities, as they’re both
tracing JITs.

I spent ~7 years of my career optimizing existing code for PyPy at a
high-frequency trading firm, and I realized that I’m probably one of the few
people in the world with actual experience in optimizing real world Python
code for a tracing JIT.

I expect that some of the challenges which I faced will still be valid also
for CPython, and I wanted to share my experience to make sure that CPython
core devs are aware of them.

One lesson which I learned is that the set of benchmarks in pyperformance are
a good starting point, but they are not entirely representative of what you
find in the wild.

The main goal of the talk is not to present solutions to these problems,
but to raise awareness that they exist.

Assumption

The JIT revolutionizes performance characteristics
CPython perf will look like PyPy’s
==> Some results are surprising

Until now CPython’s performance has been particularly predictable, there are
well established “performance tricks” to make code faster, and generally
speaking you can mostly reason about the speed of a given piece of code
“locally”.

Adding a JIT completely changes how we reason about performance of a given
program, for two reasons:

JITted code can be very fast if your code conforms to the heuristics
applied by the JIT compiler, but unexpectedly slow(-ish) otherwise;
the speed of a given piece of code might depend heavily on what happens
elsewhere in the program, making it much harder to reason about
performance locally.

The end result is that modifying a line of code can significantly
impact seemingly unrelated code. This effect becomes more pronounced as the
JIT becomes more sophisticated.

The CPython JIT is still pretty new and doesn’t give huge speedups yet. I
expect that as it gets faster, its performance will start looking more and
more like PyPy’s.

High Frequency Trading firm (sport betting)
every ms counts
Python 2.7
Multi process system: stateful server + dispatcher + stateless workers (long-running processes)
“big” messages passed around

PyPy JIT 101

Interpreter written in RPython
RPython -> *.c -> gcc -> ./pypy
RPython -> “jit codegen” -> “jitcodes” (~RPython IR)
RPython jitcodes ~= CPython microops
Slightly higher level than C
Tracing means executing jitcodes
we have an interpreter for that, super slow

I delivered this talk at the Core Dev Sprint: I expected my audience to be
familiar with CPython’s JIT, and wanted to draw parallels with PyPy’s one.

Since the audience of this blog is different, let me briefly explain
CPython’s JIT first.

The explanations of both JITs are necessarily short, incomplete and highly
simplified.

CPython JIT 101

Python source code is turned into bytecode. Bytecode is a sequence of
“opcodes” (LOAD_FAST, BINARY_OP, etc.), and the CPython VM is an
interpreter for those opcodes. Historically the VM was written by hand, and the
main loop consisted of a big switch statement which executed the code
corresponding to each opcode.

Nowadays things are different: the opcodes are written in a special DSL and
the main interpreter loop is generated from this
DSL. Additionally, the DSL describes how each opcode can be decomposed into
multiple “microops”.

When the interpreter detects a “hot loop”, it starts the JIT. The JIT
retroactively looks at the opcodes which were executed in the last iteration
of the loop, and creates a “linear trace” which contains the equivalent
microops. This process is called trace projection and the result is an
unoptimized trace of microops.

Then, the JIT can produce an optimized trace, by reordering and removing
redundant microops. Finally, the optimized trace is turned into executable
code using the “copy & patch” technique.

PyPy JIT 101

CPython’s Python interpreter is written in C, and then compiled into an
executable by gcc (or any other C compiler).

Similarly, PyPy’s Python interpreter is written in RPython, and then compiled
into an executable by rpython.

Under the hood, rpython applies two separate transformations to the source
code:

it turns each function into C code, which is then fed to gcc to get the
final executable;
it turns each function into “jitcodes”, which is a way to represent
RPython’s IR (internal representation). For each RPython function, the
final ./pypy executable contains its compiled representation (generated
by GCC) and its jitcode representation (embedded as static data into the
executable).

In a way, RPython’s jitcodes are equivalent to CPython’s microops, as they are
a low-level representation of the logic of each opcode.

When the interpreter detects a hot loop, it enters trace recording mode,
which is essentially an interpreter which executes the jitcodes: the result is
a linear unoptimized trace of all the jitcodes which were actually executed.

Similarly to CPython, PyPy then produces an optimized trace, which is then
sent to the JIT backend for actual native code generation.

Tracing JITs work by recording a trace of all microops which are
executed. The optimizer can then reason about what happens in the trace and
remove unneeded operations.

However, sometimes we encounter some operation which is a black box from the
point of view of the tracer: we call them “trace blocker”, because the tracing
JIT cannot see through them. In the case of CPython, this happens for
example, whenever we call any function implemented in C (because it doesn’t
have any correspondent “microop”).

This is a simple function that computes pi, generated by ChatGPT. Its
precise content is not important: what matters is that it’s a nice purely
numerical loop that the PyPy JIT can optimize very well.

Problem 1: trace blockers

def get_pi():
    """
    Compute an approximation of PI using the Leibniz series
    """
    tol = 0.0000001
    pi_approx = 0.0
    k = 0
    term = 1.0  # Initial term to enter the loop

    while abs(term) > tol:
        if k % 2 == 0:
            term = 1.0 / (2 * k + 1)
        else:
            term = -1 * 1.0 / (2 * k + 1)

        pi_approx = pi_approx + term
        k = k + 1
        hic_sunt_leones() # 

    return 4 * pi_approx

Same function as above, with a call to hic_sunt_leones(). This is actually
an empty function which does absolutely nothing, but annotated in a
special way so that the PyPy JIT cannot “enter” it, so it effectively behaves
as trace blocker.

Hic sunt leones

def empty():
    pass      # the JIT cannot enter here

def hic_sunt_leones():
    pypyjit.residual_call(empty)

In this example we use the special pypyjit.residual_call to simulate a trace
blocker, but in real life we get it whenever we have a call to any
non-traceable function, in particular C extensions.

❯ python3.13 pi.py
2.1712 secs, pi = 3.1415928535897395

❯ pypy pi.py
0.0518 secs, pi = 3.1415928535897395

❯ # with "hic_sunt_leones()"
❯ pypy pi.py
1.1808 secs, pi = 3.1415928535897395

The clean version runs 42x faster on PyPy than CPython – that’s the JIT
working perfectly. But with just one untraceable function call added to the
loop, PyPy slows down to only 1.8x faster than CPython. That single line
destroyed most of the JIT’s effectiveness!

This happens because after the call the optimizer no longer knows whether its
assumptions about the world are still true, and thus must be much more
conservative.

I fear that for CPython, this will turn out to be a much bigger problem than
for PyPy, for two reasons:

nowadays it’s virtually impossible to run Python code without
using any C extension, either directly or indirectly.
by construction, PyPy’s JIT can see much more than CPython’s
JIT. Remember the slide about “jitcodes”: any RPython function gets a
“jitcodes” equivalent, which means that the JIT can automatially trace
inside builtins and internals of the interpreter, whereas CPython can
trace only inside pure python code.

For example, PyPy’s JIT can trace through range(), zip, and enumerate()
automatically. CPython’s JIT currently cannot because they are implemented in
C. CPython could add special cases for these common functions, but the
general approach doesn’t scale.

Problem 2: data driven control flow

def fn(v=None, a=None, b=None, c=None, d=None, e=None, f=None, g=None, h=None):
    "Random nonsense computation generated by ChatGPT"
    if v is None: v = 0
    if a is None: a = 1.25
    if b is None: b = -0.75
    [...]
    y = a * v + b
    if y  f: y = f
    [...]
    return y

def main():
    [...]
    for row in DATA:
        acc += fn(*row)

The second big problem is what I call “data driven control flow”. This example
has been autogenerated by ChatGPT and it’s completely silly, but it’s a good
representation of what happens in real life code.

In this example, fn takes 9 variables, each of them can be None or a
number. The function starts with a sequence of if is None: .... The
function is then called repeatedly in a loop.

One of the assumption of tracing JITs is that control flow tends to stay on
the “hot path”, and that it’s enough to optimize that to get good performance.

But in a case like this, each combination of Noneness selects a different
path, and if we assume the data is evenly distributed, we find out that
there is no hot path.

Let’s see what happens when we execute on CPython and PyPy:

Problem 2: data driven control flow

❯ python3.13 data_driven.py
0.1274 secs

❯ pypy --jit off data_driven.py
0.2953 secs

❯ pypy data_driven.py
1.6414 secs

PyPy without JIT is “only” 2.3x slower than CPython, but when we enable the
JIT, it becomes much worse. This happens because of an exponential
explosion of code paths seen by the JIT.

In a normal compiler, an if statement is compiled as a diamond, and the
control flow merges after each if:

        if a is None
          /   \
         /     \
      a = 0    pass
         \     /
          \   /
        if b is None
          /   \
         /     \
      b = 0    pass
         \     /
          \   /
           ...

A tracing JIT by definition follows what’s happening during a concrete
execution, so it sees only a concrete path in the control flow, with “guards”
to ensure correctness:

        guard(a is None)
          /
         /
      a = 0
         \
          \
   guard(b not None)
          /
         /
      b = 0
         \
          \
          ...

When guard(a is None) fails enough times, we create a “bridge” and record
another linear trace, following again the concrete control flow that happens
now:

          guard(a is None) ----> FAIL (side exit)
            /                         \
           /                           \
        a = 0                          pass
           \                             \
            \                             \
    guard(b not None)              guard(b not None)
            /                             /
           /                             /
        b = 0                         b = 0
           \                             \
            \                             \
           ...                           ...

Note how b = 0 is effectively duplicated now. By design, PyPy’s JIT never
merges execution flow.

Exponential tracing

Every combination of “Noneness” must be compiled separately

❯ PYPYLOG=jit-summary:- pypy data_driven.py
1.6387 secs
[a625ea04910] {jit-summary
...
Total # of loops:   11
Total # of bridges: 527
...
[a625ea507bc] jit-summary}

Looking inside PYPYLOG confirms our theory: we get “exponential
tracing”. The JIT has to compile separate optimized code for every unique
combination of which parameters are None and which aren’t. With 9 parameters,
that could be up to 512 different combinations!

Exponential tracing

Mitigation: branchless code

if x    x = 100
# ===>
x = (x = 0)*x

Ugly, unreadable, not always possible
Never found a good solution
Happens quite a lot
Fundamental problem of tracing JITs?

One possible mitigation is to rewrite conditional code to be “branchless” –
using arithmetic tricks instead of if statements. But this makes code ugly and
unreadable, and it’s not always possible.

Despite years of working on this, I never found a really good solution. There
were cases in which we had to continue running some piece of code on CPython
because I never managed to make the PyPy version faster.

This pattern happens quite a lot, although often is more subtle: in this silly
example all the ifs are nicely grouped together at the start, but in a long
trace they can be scattered in multiple places, and any kind of control flow
contributes to the problem, not only ifs. In Python, this includes any kind
of dynamic dispatch, exceptions, etc.

One possible solution for CPython’s JIT is to try to merge (some) traces to
avoid or limit the exponential explosion. However, it is worth underlining that
tracing JITs shine precisely when they can optimize a long linear trace: if
you try to compile shorter traces, you might quickly end up in a situation
which is equivalent to the “trace blocker” problem described earlier.

I suspect this might be a fundamental limitation of tracing JITs.

Problem 3: generators (and async?)

def count_triples_loop(P):
    """
    Counts how many integer right triangles (Pythagorean triples) have perimeter 
    """
    m_max = int(math.isqrt(2 * P))  # loose but safe upper bound for m
    count = 0
    for m in range(1, m_max + 1):
        for n in range(1, m_max + 1):
            if ((m - n) & 1) and math.gcd(m, n) == 1:
                p0 = 2 * m * (m + n)  # a+b+c
                if p0 > P:
                    continue
                count += P // p0
    return count

Compared to the other two problems, this is less serious, but it’s worth
mentioning because of prevalence of async (and thus implicitly generators)
in modern Python.

Here’s another silly function that counts Pythagorean triples using nested
loops. This is our baseline version using plain loops.

Problem 3: generators (and async?)

def range_product(a, b):
    for i in range(*a):
        for j in range(*b):
            yield i, j

def count_triples_gen(P):
    m_max = int((math.isqrt(2 * P)))
    count = 0
    for m, n in range_product((1, m_max + 1), (1, m_max + 1)):
        if ((m - n) & 1) and math.gcd(m, n) == 1:
            p0 = 2 * m * (m + n)  # a+b+c
            if p0 > P:
                continue
            count += P // p0
    return count

Here’s the same algorithm refactored to use a generator function for the
nested iteration. The “state of iteration” is implicitly stored inside the
local variables of frame object associated to the range_product generator.

Problem 3: generators (and async?)

class RangeProductIter:

    def __init__(self, a, b):
        self.i, self.n = a
        self.j, self.m = b

    def __iter__(self):
        return self

    def __next__(self):
        if self.i >= self.n:
            raise StopIteration
        value = (self.i, self.j)
        self.j += 1
        if self.j >= self.m:
            self.j = 0
            self.i += 1
        return value

Here’s the same functionality implemented as a traditional iterator class. The
“state of iteration” is explicitly stored as attributes of RangeProductIter.

Problem 3: generators (and async?)

❯ python3.13 pythagorean.py
loop: 0.4560 secs (1x)
gen:  0.5884 secs (1.29x)
iter: 1.0126 secs (2.22x)

❯ pypy pythagorean.py
loop: 0.1199 secs (1x)
gen:  0.1550 secs (1.29x)
iter: 0.1264 secs (1.05x)

Generators force to create a frame
The JIT cannot see “through” generators
In real code, much worse slowdowns

On CPython, the generator version is ~29% slower than the explicit loops. The
iterator class is much slower, as one would intuitively expect.

However, on PyPy we see different results: RangeProductIter is basically
same speed as the baseline, while the generator version is slower. This
happens because in the case of RangeProductIter the JIT is able to see the whole
lifetime of the object and optimize it away entirely: instance variables
become local variables, the call to __next__ is inlined and we get the
equivalent of explicit nested loops.

However, generators are required to create a frame object and represent a
fundamental case in which the JIT cannot trace through them effectively. In
more complex real-world scenarios, we saw much worse slowdowns than these
examples show.

Bonus slides

(Avoid) allocations is all your need

These are slides which I didn’t show during the live presentation, and show a
case where a tracing JIT can shine: since the JIT sees a complete trace of an
entire loop (including nested calls) it can easily removes a lot of temporary
objects which usually penalize Python performance.

In many cases, we can get the famous “zero-cost abstractions”.

Task

Compute center of gravity of a series of triangles serialized according to a binary
protocol
Simulate protobuf, capnproto, etc.

struct Point {
    double x;
    double y;
};

struct Triangle {
    Point a;
    Point b;
    Point c;
};

Let’s look at a concrete example. We need to compute the barycenter of
triangles that are serialized in a binary format. Each triangle has three
points, each point has x and y coordinates. This simulates real world
protocols such as protobuf, capnproto, etc.

Bare loop

def read_loop():
    fmt = 'dddddd'
    size = struct.calcsize(fmt)
    tot_x = 0
    tot_y = 0
    n = 0
    with open('poly.bin', 'rb') as f:
        while True:
            buf = f.read(size)
            if not buf:
                break
            points = struct.unpack_from(fmt, buf)
            ax, ay, bx, by, cx, cy = points
            tot_x += (ax + bx + cx)
            tot_y += (ay + by + cy)
            n += 1

    print(n)
    x = tot_x/n
    y = tot_y/n
    return x, y

This is what we use a a baseline: a bare loop, using struct.unpack_from to read 6 floats at a time.

Schema-aware protocol

class Triangle:
    def __init__(self, buf, offset):
        self.buf = buf
        self.offset = offset

    @property
    def a(self):
        return Point(self.buf, 0)

[...]

class Point:
    def __init__(self, buf, offset):
        self.buf = buf
        self.offset = offset

    @property
    def x(self):
        return struct.unpack_from('d', self.buf, self.offset)[0]

Here’s the “proper” object-oriented approach, similar to how modern
serialization libraries work. We create Triangle and Point classes that
provide a nice API for accessing the binary data. Each property access creates
new objects and calls struct.unpack_from. This is much more readable and
reusable, but creates many temporary objects.

Schema-aware protocol

        while True:
            buf = f.read(size)
            if not buf:
                break
            t = Triangle(buf, 0)
            tot_x += t.a.x + t.b.x + t.c.x
            tot_y += t.a.y + t.b.y + t.c.y
            n += 1

Here’s how you’d use the object-oriented API. The code is much cleaner and
more readable than the bare loop version. But notice how many object creations
are happening: one Triangle object, six Point objects, plus all the
intermediate tuples from struct.unpack_from.

❯ python3.13 readpoly.py
read_loop:     0.5444 secs
read_proto:    3.0307 secs

❯ pypy readpoly.py
read_loop:     0.2945 secs
read_proto:    0.1183 secs

As expected, on CPython read_proto is much slower than the bare one,
roughly 6x slower. However, PyPy can fully optimize away all the
abstraction overhead introduced by Triangle and Point.

In PyPy jargon we call this form of allocation removal “virtuals” (because we
create “virtual objects” whose fields are represented as local variables) and
it’s probably the single most important optimization that PyPy does.

During my week in Cambridge I talked extensively with the CPython JIT devs
about this and I hope I convinced them that this is what they should aim for
😊.

Note also that read_proto is actually faster than read_loop. This
happens because in read_loop we do a single struct.unpack_from('dddddd', ...),
while in read_proto we do a succession of six individual
struct.unpack_from('d', ...). It turns out that the JIT is able to trace
into the second form but not into the first, which means that in read_loop
we actually need to allocate a pseudo-tuple at each iteration.

The funny part is that I did not expect to get this result. I had to take
the time to analyze the JIT traces of both versions to understand why
read_loop was slower. This is probably the best explanation of how
counterintuitive it is to reason about performance in a JITted world.

Acknowledgments

Thanks to Carl Friedrich Bolz-Tereick and
Hood Chatham for feedback on the slides and the
post.

Source link

DZdano

Administrator

Visit Website View All Posts

Leave a Reply Cancel reply

Related Stories

Knicks vs. Cavaliers score, live updates, how to watch: Jalen Brunson, James Harden tip off in Game 1 of Eastern Conference finals – Yahoo Sports

What AMD’s Earnings Beat Could Mean for Intel

AMD stock rises on Q1 earnings beat, better-than-expected outlook – Yahoo Finance

You may have missed

Ubisoft Says New Assassin’s Creed, Far Cry, Ghost Recon Coming…Eventually

Quantic Dream Promises Star Wars Eclipse Is Fine As MOBA Flops

Damon Lindelof Reveals His Canned New Jedi Order Star Wars Film

New Lego Batman Has An Incredible Limmy Show Deep Cut Joke

Tracing JIT and real world Python

aka: what we can learn from PyPy

PyPy JIT 101

CPython JIT 101

PyPy JIT 101

Problem 1: trace blockers

Problem 1: trace blockers

Hic sunt leones

Problem 2: data driven control flow

Problem 2: data driven control flow

Exponential tracing

Exponential tracing

Problem 3: generators (and async?)

Problem 3: generators (and async?)

Problem 3: generators (and async?)

Problem 3: generators (and async?)

Other misc problems

Bonus slides

(Avoid) allocations is all your need

Task

Bare loop

Schema-aware protocol

Schema-aware protocol

Acknowledgments

About the Author

Leave a Reply Cancel reply

Related Stories

You may have missed