|
NB: all cited papers are linked on RelevantPapers. This document is also available in Chinese, though we can't vouch for the translation. We want to make Python faster, but we also want to make it easy for large, well-established applications to switch to Unladen Swallow. - Produce a version of Python at least 5x faster than CPython.
- Python application performance should be stable.
- Maintain source-level compatibility with CPython applications.
- Maintain source-level compatibility with CPython extension modules.
- We do not want to maintain a Python implementation forever; we view our work as a branch, not a fork.
In order to achieve our combination of performance and compatibility goals, we opt to modify CPython, rather than start our own implementation from scratch. In particular, we opt to start working on CPython 2.6.1: Python 2.6 nestles nicely between 2.4/2.5 (which most interesting applications are using) and 3.x (which is the eventual future). Starting from a CPython release allows us to avoid reimplementing a wealth of built-in functions, objects and standard library modules, and allows us to reuse the existing, well-used CPython C extension API. Starting from a 2.x CPython release allows us to more easily migrate existing applications; if we were to start with 3.x, and ask large application maintainers to first port their application, we feel this would be a non-starter for our intended audience. The majority of our work will focus on speeding the execution of Python code, while spending comparatively little time on the Python runtime library. Our long-term proposal is to replace CPython's custom virtual machine with a JIT built on top of LLVM, while leaving the rest of the Python runtime relatively intact. We have observed that Python applications spend a large portion of their time in the main eval loop. In particular, even relatively minor changes to VM components such as opcode dispatch have a significant effect on Python application performance. We believe that compiling Python to machine code via LLVM's JIT engine will deliver large performance benefits. Some of the obvious benefits: - Moving to a JIT will also allow us to move Python from a stack-based machine to a register machine, which has been shown to improve performance in other similar languages (Ierusalimschy et al, 2005; Shi et al, 2005).
- Eliminating the need to fetch and dispatch opcodes should alone be a win, even if we do nothing else. See http://bugs.python.org/issue4753 for a discussion of CPython's current sensitivity to opcode dispatch changes.
- The current CPython VM opcode fetch/dispatch overhead makes implementing additional optimizations prohibitive. For example, we would like to implement type feedback and dynamic recompilation ala SELF-93 (Hölzle, Chambers and Ungar, 1992), but we feel that implementing the polymorphic inline caches in terms of CPython bytecode would be unacceptably slow.
- LLVM in particular is interesting because of its easy-to-use codegen available for multiple platforms and its ability to compile C and C++ to the same intermediate representation we'll be targeting with Python. This will allows us to do inlining and analysis across what is currently a Python/C language barrier.
With the infrastructure to generate machine code comes the possibility of compiling Python into a much more efficient implementation. For example, take the snippet for i in range(3): foo(i) This currently desugars to something like $x = range(3) while True: try: i = $x.next() except StopIteration: break foo(i) Once we have a mechanism to know that range() means the range() builtin function, we can turn this into something more akin to for (i = 0; i < 3; i++) foo(i) in C, possibly using unboxed types for the math. We can then unroll the loop to yield foo(0) foo(1) foo(2) We intend to structure Unladen Swallow's internals to assume that multiple cores are available for our use. Servers are only going to acquire more and more cores, and we want to exploit that to do more and more work in parallel. For example, we would like to have a concurrent code optimizer that applies increasingly-expensive (and beneficial!) optimizations in parallel with code execution, using another core to do the work. We are also considering a concurrent garbage collector that would, again, utilize additional cores to offload work units. Since most production server machines are shipping with between 4 and 32 cores, we believe this avenue of optimization is potentially lucrative. However, we will have to be sensitive to the needs of highly-parallel applications and not consume extra cores blindly. Note that many of the areas we will need to address have been considered and developed by the other dynamic language implementations like MacRuby, JRuby, Rubinius and Parrot, and in particular other Python implementations like Jython, PyPy, and IronPython. In particular, we're looking at these other implementations for ideas on debug information, regex performance ideas, and generally useful performance ideas for dynamic languages. This is all fairly well-trodden ground, and we want to avoid reinventing the wheel as much as possible. Unladen Swallow will be released every three months, with bugfix releases in between as necessary. Q1 will be spent making relatively minor tweaks to the existing CPython implementation. We aim for a 25-35% performance improvement over our baseline. Our goals for this quarter are conservative, and are aimed at delivering tangible performance benefits to client applications as soon as possible, that is, without waiting until the completion of the project. Ideas for achieving this goal: - Re-implement the eval loop in terms of vmgen.
- Experiment with compiler options such as 64 bits, LLVM's LTO support, and gcc 4.4's FDO support.
- Replace rarely-used opcodes with functions, saving critical code space.
- Improve GC performance (see http://bugs.python.org/issue4074).
- Improve cPickle performance. Many large websites use this heavily for interacting with memcache.
- Simplify frame objects to make frame alloc/dealloc faster.
- Implement one of the several proposed schemes for speeding lookups of globals and builtins.
The 2009Q1 release can be found in the release-2009Q1-maint branch. See Releases for our performance relative to CPython 2.6.1. Q2 will focus on eliminating the Python VM and replacing it with a functionally-equivalent implementation in terms of LLVM. We anticipate some performance improvement, but that is not the primary focus of the 2009Q2 release. We will focus on just getting something working on top of LLVM. Making it faster will come in subsequent quarters. Goals: The plan for Q3 onwards is to simply iterate over the literature. We aspire to do no original work, instead using as much of the last 30 years of research as possible. See RelevantPapers for a partial (by no means complete) list of the papers we plan to implement or draw upon. We plan to address performance considerations in the regular expression engine, as well as any other extension modules found to be bottlenecks. However, regular expressions are already known to be a good target for our work and will be considered first for optimization. In addition, we intend to remove the GIL and fix the state of multithreading in Python. We believe this is possible through the implementation of a more sophisticated GC system, something like IBM's Recycler (Bacon et al, 2001). Our long-term goal is to make Python fast enough to start moving performance-important types and functions from C back to Python. The exact performance targets for the 2009Q3 release will be finalized some time in Q2. Once we get the quick improvements out of the way and convince some people to adopt our Python, we'll start switching to LLVM, probably initially in a separate branch. This can go in lots of stages: - Add llvmmodule and llvmfunction types in Python, which interface to the LLVM Module and Function types. Teach llvmmodule to read bitcode files and pretty-print them. (Reading the bitcode files isn't really necessary for the next step, but it'll help with testing this one.)
llvm-py has such wrapper types, but using Python-defined types in the core interpreter is very difficult, so we'll write our own. These won't be full wrappers (leave that to llvm-py). Just what's convenient and useful for debugging. - Add a compile_llvm builtin that uses most of the same code as the compile builtin but generates LLVM IR (as the llvmmodule or llvmfunction type) instead. We can get this working incrementally and keep a working Python interpreter the whole time. Along the way, we'll probably take each opcode and translate its ceval implementation into an IRBuilder implementation, but for difficult opcodes we could just emit calls to an interpretation function. We may want to/have to mix this step with the next one. Several Python constructs may give us trouble here:
- Generators: I don't have a plan for generators yet. We'd have to save the PC, registers, and locals, except that LLVM doesn't expose a PC, just labels, so we'd probably give each yield an index and switch on the current index to get back to it. And we could require that all registers have been spilled at that point, maybe. But I don't know the details of doing most of that. I'm hoping to figure it out when I get there.
- Closures: A closure is just a function with a pre-bound parameter representing the environment.
- ExceptionHandling
- The FunctionCallingConvention
- Add a way to call llvmfunctions. The plain LLVM interpreter may not be able to call external functions by the time we get here. JITting may be slightly more difficult. We'll do whatever's easier. All we want is to run stuff. Write lots of tests.
- Replace the Python-bytecode compiler with the new LLVM compiler, possibly under a flag. Talin suggests having both run and select the one to execute based on a flag in the code/function object. The flag will let us run regrtest over all the tests in both modes.
- Figure out the simplest possible version of .pyc persistence. I'm hoping the obvious thing -- dump the python module as an LLVM Module in bitcode to disk -- just works. We can add the fancy profiling data and pre-optimized code later (step 12).
- Teach Python how to decide whether to fast-JIT or slow-JIT a particular function (pass fast==true to JIT::create). nlewycky says, "The codegen is roughly 3x faster in fast mode (note: measured in debug mode not release mode, so who knows).". I imagine that we'll have to tune a bit to find the right balance. At this point I expect Python to get faster than our original release.
- Start adding optimizations.
Talin points out that it'll be hard to apply most of the pre-defined optimizations at first because Python calls are indirect. This motivates type inference so we can make the calls direct. 8+ can be tried in parallel, probably starting toward the end of 2009Q2. - Generate better LLVM IR than a direct translation from CPython bytecode would imply.
- Figure out how to specialize functions for particular types, which allows better inlining. We should be able to do as well as psyco pretty easily since we can steal their algorithms.
- Allow users to annotate functions to tell the interpreter which specializations to use.
- Profile calls and infer common argument types.
- If the optimizations become expensive enough that the JIT hurts our startup time, we can profile for expensive functions and only optimize them. This resembles how HotSpot works.
- 9b and 10 require us to be able to keep multiple versions of code around for any given function and backpatch existing callers. We'll probably want to GC the multiple versions, which should be interesting.
- Persist the generated and optimized IR to disk in place of .pyc files, perhaps along with any gathered profile data. This allows subsequent runs to take advantage of the optimization work previous runs have already done. Matches LLVM's promise to be "A Compilation Framework for Lifelong Program Analysis & Transformation". :)
Unladen Swallow maintains a directory of interesting performance tests under the tests directory. perf.py is the main interface to the benchmarks we care about, and will take care of priming runs, clearing *.py[co] files and running interesting statistics over the results. Unladen Swallow's benchmark suite is focused on the hot spots in major Python applications, in particular web applications. The major web applications we have surveyed have indicated that they bottleneck primarily on template systems, and hence our initial benchmark suite focuses on them: - Django and Spitfire templates. Two very different ways of implementing a template language.
- 2to3. Translates Python 2 syntax to Python 3. Has an interesting, pure-Python kernel that makes heavy use of objects and method dispatch.
- Pickling and unpickling. Large-scale web applications rely on memcache, which in turns uses Python's Pickle format for serialization.
Apart from these, our benchmark suite includes several crap benchmarks like Richards, PyStone and PyBench; these are only included for completeness and comparison with other Python implementations, which have tended to use them. Unladen Swallow does not consider these benchmarks to be representative of real Python applications or Python implementation performance, and does not run them by default or make decisions based on them. For charting the long-term performance trend of the project, Unladen Swallow makes use of Google's standard internal performance measurement framework. Project members will post regular performance updates to the mailing lists. For testing individual changes, however, using perf.py as described on the Benchmarks page is sufficient. In order to ensure correctness of the implementation, Unladen Swallow uses both the standard Python test suite, plus a number of third-party libraries that are known-good on Python 2.6. In particular, we test third-party C extension modules, since these are the easiest to break via unwitting changes at the C level. As work on the JIT implementation moves forward, we will incorporate a fuzzer into our regular test run. We plan to reuse Victor Stinner's Fusil Python fuzzer as much as possible, since it a) exists, and b) has been demonstrated to find real bugs in Python. Unladen Swallow maintains a BuildBot instance that runs the above tests against every commit to trunk. - May not be able to merge back into mainline. There are vocal, conservative senior members of the Python core development community who may oppose the merger of our work, since it will represent such a significant change. This is a good thing! Resistance to change can be very healthy in situations like this, as it will force a thorough, public examination of our patches and their possible long-term impact on the maintenance of CPython -- this is open source, and another set of eyes is always welcome. We believe we can justify the changes we're proposing, and by keeping in close coordination with Guido and other senior members of the community we hope to limit our work to only changes that have a good chance of being accepted. However: there is still the chance that some patches will be rejected. Accordingly, we may be stuck supporting a de facto separate implementation of Python, or as a compromise, not being as fast as we'd like. C'est la vie.
- LLVM comes with a lot of unknowns: Impact on extension modules? JIT behaviour in multithreaded apps? Impact on Python start-up time?
- Windows support: CPython currently has good Windows support, and we'll have to maintain that in order for our patches to be merged into mainline. Since none of the Unladen Swallow engineers have any/much Windows experience or even Windows machines, keeping Windows support at an acceptable level may slow down our forward progress or force us to disable some performance-beneficial code on Windows. Community contributions may be able to help with this.
All communication about Unladen Swallow should take place on the Unladen Swallow list. This is where design issues will be discussed, as well as notifications of continuous build results, performance numbers, code reviews, and all the other details of an open-source project.
|
Why not apply this work to Python 3? If you are successful it may encourage the adoption of Python 3 more quickly (and you obviously won't have to port this work to Python 3). --Nick
I have the same question to chinbillybilbo
I have the same question to ctismer!!!!!!!!!!one
chinbillybilbo: the applications we are trying to speed up all use Python 2.x. If we required all these applications to first port to Python 3 in order to get any performance benefit, we feel that would be a non-starter for the applications we're focusing on.
Very interesting.
W.r.t. the GC: Apple released their GC under the Apache license (http://www.opensource.apple.com/darwinsource/10.5.5/autozone-77.1/). Its finalizer semantics seem slightly incompatibel with that of Python, in that a finalizer may not revive an object, but it might be possible to work around that.
is ctismer sarcastic? This project looks incredibly professional. Good luck guys!
Reading through all of this got me really excited about this project, and then I read the killer: "Will probably kill Python Windows support (for now).". For integration reasons, my company is currently stuck on Windows!
Anyways, I'm still excited about this project! Making Python faster can only be good for the community!
I think this is great. If it just reduces my AppEngine? costs, I am happy. Please test Twisted too.
the creator of ruby is twitting this http://twitter.com/yukihiro_matz
Today's ars technica article http://arstechnica.com/open-source/news/2009/03/google-launches-project-to-boost-python-performance-by-5x.ars calls this a Google project. But I see no such claim, just the fact that you have access to Google's performance measurement framework. And of course, anyone can start a code.google.com project. Can you clarify for those curious?
I think a mainline branch of Python will need to continue to provide an interpreter, at least as a compile-time option. You can't beat plain old C for portability, and LLVM has a non-trivial memory footprint (if you load up all the optimization modules) which might not work so well for embedded systems.
Maintaining the C API compatibility will also be an interesting problem. Removing the GIL, creating an advanced garbage collector, and moving away from reference counting will probably go hand-in-hand, and the refcount-oriented single-threaded C API is not going to make things easy. You may want to leave open the possibility of a designing new C API (sans refcounting) while supporting the legacy API using proxy objects or object handles -- you can't have a copying/compacting GC if C code has direct pointers to Python objects.
Is it possible to have a big of background on who is financing the project, history of the persons doing it ?
The plans looks very real but it sounds strange to see a "let's make python 5x faster in less than one year", without even Guido participating.
Here at Red Hat we use Python for a lot of things. What we've observed is that execution performance is not the main issue (although it improving it would be greatly appreciated), rather it's the memory footprint which is the problem we most often encounter. If anything can be done to reduce the massive amount of memory Python uses it would be a huge win. I would encourage you to consider memory usage as just as important a goal as execution speed if you're going to tackle optimizing CPython.
This project is Google-financed, but not Google-owned. The two engineers working on this are full-time Google engineers in the compiler optimization team, working on this as their main project. We have other Googlers contributing patches in their 20% time.
Despite that, this is not Google's property. We are pushing changes upstream as quickly as we can, some of which are already in CPython mainline. Google cares a great deal about performance, and because we realize that other people do too, we want to contribute our work back to the open-source world so that everyone can reap the benefit.
Sounds great, good luck!
Why this instead of contributing to Parrot? It seems like both projects have overlapping goals.
This is wonderful! I was hoping Google will start such a project, since Python is one of Google's official languages. I'll be following this during the next months. A big thanks for sharing your work!
"Unladen Swallow"? This Monty Python reference has me thinking that this is leading up to a nice April Fools Day!
svn looks authentic, so I could be mistaken :)
neenad- "Python" is a Monty Python reference!
Awesome project. LLVM is great for VM's and Python need this. Apple use LLVM for their OS architecture and graphics with OpenGL. It could be that Python will replace Java in the future.
There was another "Python3k" project to use a 64-bit architecture on top of the Apache Portable Runtime for the VM called Prothon. Check out this comp.lang.python thread:
http://groups.google.com/group/comp.lang.python.announce/browse_thread/thread/1e6ebaa7b2c98994/acb0e1edb2ca449a?lnk=st&q=comp.lang.python+prothon#acb0e1edb2ca449a
(or google? search: comp.lang.python prothon hahn collins)
...also check out this follow up which includes suggestions that should have made it into python3k but didn't:
http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-03/4822.html (search: comp.lang.python prothon hahn zipher python3k)
frikker: I think I wasn't sarcastic, but upset, because the project seemed to completely ignore PyPy?. But I wish them luck, too. Would love to see more collaboration between projects, actually.
This is definitively a need for the Python user/developper. Really impatient to use !