Python VM Summit - Rough Notes

In parallel with the the 2 days of tutorials at Pycon, there are a couple of day long meetings for invited folks active in the evolution of the language itself. Today was the VM summit, which focuses on the major Python interpreter implementations (CPython, PyPy, Jython, IronPython), the current status of each, and where things are likely to head in the near- and long-term. (The Thursday session focuses more on the evolution of the language itself, as well as of the wider ecosystem).

CPython and PyPy both had multiple devs at the summit, IronPython and Jython devs were also there (although IronPython got to share their's with CPython). We also has some Parrot VM folks there, as well as one of the Mozilla Javascript devs - a bunch of issues with VM development for dynamic languages apply across languages, despite differences in the surface syntax.

The notes below are probably too cryptic to make sense out of context, bit will hopefully give the gist of what was discussed. These notes are my interpretation of what was said, and may or may not reflect what people actually meant. Names omitted to protect the guilty (and because I didn't write them down)

Commit rights for other VM core devs
  - good idea
  - did some of this last Pycon US
  - will look into adding more this week

Splitting out the standard library and test suite (again)
  - duplication of effort between CPython/IronPython/Jython/PyPy
  - shared commit rights intended to make it easier near term to use CPython as master, allowing bugs to be fixed "upstream"
  - hg transition should make sharing easier
  - main CPython release will stay "batteries included"
  - open to the idea of providng "CPython minimal" and "standard library" downloads (but much work to be done in defining a minimum set)
  - longer term, may want to separate pure-Python stdlib development from "C skills required" hacking on the CPython interpreter core and C accelerated implementation modules for the stdlib

Speed benchmarking
  - speed.pypy.org (very cool!)
  - benchmarks originally chosen by Unladen Swallow team
  - PSF may talk to OSU OSL about setting up speed.python.org
  - benchmark multiple versions of CPython, as well as Jython and IronPython
  - currently benchmarks are 2.x specific, may be a while before 3.x can be compared fully
  - may be GSoC projects in:
      - improving backend infrastructure to handle more interpreters
      - porting benchmarks to Python 3
  - can highlight key performance differences between the implementations (e.g slowspitfire vs spitfire-cstringio)

Python.org download pages
  - should start recommending alternative interpreters more prominently
  - PyPy likely to be faster for pure Python on major platforms
  - IronPython/Jython/CPython still best at integration with their respective environments (Java libraries, .NET linraries, C extensions)

Cool hacks
  - Maciel: pypy JIT viewer
  - Dave Malcolm: CPython HEAP viewer in GDB 7
 
Parrot VM (and JIT for dynamic languages)
  - target VM for dynamic languages (primarily Perl 6 and Tcl at the moment)
  - loadable operations, loadable object types 
  - dynamic ops were original speed target, now moving towards dynamic types instead
  - exploring reducing number of core ops to make JIT more practical
  - looking into taking advantage of LLVM
  - Unladen Swallow blazed this trail, so LLVM has better dynamic language support
  - PyPy has tried and failed to use LLVM as an effective backend
  - some issues may have been fixed due to Unladen Swallow's efforts, but others still exist (e.g. problems with tail recursion)
  - SpiderMonkey similarly struggles with JIT and dynamic patching issues
  - GNU Lightning and LiveJIT projects noted, but nobody really familiar with them 
  - any future Python-on-Parrot efforts likely to focus on using PyPy frontend with Parrot as a backend
  - proof-of-concept written (for a thesis?) that used .NET as a backend target for PyPy
  - original Python-on-Parrot ran into problems due to semantic mismatches between Perl 6 and Python - reached the limits of the degree of difference the Perl 6 toolchain was willing to tolerate)
 
Role of the PSF
  - supports Python the Language, not just CPython the Reference Interpreter
  - could use additional feedback on how to better fulfill that role
  - getting the "boring stuff" done?
  - project-based grants, not blanket personal funding
  - project proposals requiring more funds than the PSF can provide are still valuable, as PSF can help facilitate co-sponsorships (however, still a novel concept - only been done once so far).

2.7 to 3.2
  - PyPy just reaching feature parity with 2.7
  - PyPy now becoming far more interesting for production usage
  - treat PyPy Python 3 dialect like a major Python library (e.g. sponsored by PSF)

CPython warnings for reliance on implementation details
  - ResourceWarning was a nice addition (detects reliance on refcounting for resource cleanup
  - non-string keys in class namespaces would be another good candidate for a warning
  - clarifying finalisation-at-shutdown semantics would be nice (but fixing those semantics in CPython first would help with that)

What is a Python script?

This is an adaptation of a lightning talk I gave at PyconAU 2010, after realising a lot of the people there had no idea about the way CPython's concept of what could be executed had expanded over the years since version 2.4 was released. As of Python 2.7, there are actually 4 things that the reference interpreter will accept as a main module.

Ordinary scripts: the classic main module identified by filesystem path, available for as long as Python has been around. Can be executed without naming the interpreter through the use of file associations (Windows) or shebang lines (pretty much everywhere else).

Module name: By using the -m switch, a user can tell the interpreter to locate the main module based on its position in the module hierarchy rather than by its location on the filesystem. This has been supported for top level modules since Python 2.4, and for all modules since Python 2.5 (via PEP 338). Correctly handles explicit relative imports since Python 2.6 (via PEP 366 and the __package__ attribute). The classic example of this usage is the practice of invoking "python -m timeit 'snippet'" when discussing the relative performance of various Python expressions and statements.

Valid sys.path entry: If a valid sys.path entry (e.g. the name of a directory or a zipfile) is passed as the script argument, CPython will automatically insert that location at the beginning of sys.path, then use the module name execution mechanism to look for a __main__ module with the updated sys.path. Supported since Python 2.6, this system allows quick and easy bundling of a script with its dependencies for internal distribution within a company or organisation (external distribution should still use proper packaging and installer development practices). When using zipfiles, you can even add a shebang line to the zip header or use a file association for a custom extension like .pyz and the interpreter will still process the file correctly.

Package name: If a package name is passed as the value for the -m switch, the Python interpreter will reinterpret the command as referring to a __main__ submodule within that package. This version of the feature was added in Python 2.7, after some users objected to the removal in Python 2.6 of the original (broken) code that incorrectly allowed a package's __init__.py to be executed as the main module. Starting in Python 3.2, CPython's own test suite supports this feature, allowing it to be executed as "python -m test".

The above functionality is exposed via the runpy module, as runpy.run_module() and runpy.run_path().

If anyone ever sees me (metaphorically) jumping up and down about making sure things get mentioned in the What's New document for a new Python version, this is why. Python 2.6 was released in October 2008, but we didn't get the note about the zipfile and directory execution trick into the What's New until February 2010. It is described in the documentation, but really, who reads the command line documentation, or is likely to be casually browsing the runpy docs? This post turning up on Planet Python will probably do more to get the word out about the functionality than anything we've done before now :)

Status quo wins a stalemate

Sometimes language design arguments can reach a point of stalemate. The status quo is only arguably flawed, and there are also perceived flaws in any or all of the proposed alternatives. An appropriate shared design principle can help identify when this point has been reached, and let the discussion die a natural death rather than endlessly rehashing the same points without anyone changing their opinion.

Every time we (python-dev) change anything significant, no matter how positive the end result, it can create a lot of churn in the community. Books need to be rewritten, other implementations modified, advice, recipes and examples updated, questions clarified as to which version they relate to, and version compatibility issues need to be monitored closely for projects that need to cope with older execution environments.

So, before any significant changes are made, we want to be fairly certain that the gain in clarity for future Python programs is worth the inevitable near term costs as the update ripples across the Python ecosystem. Sometimes newcomers have some interesting ideas, but still fail to clear this hurdle. The simple "it's not worth the hassle" response they're likely to receive may then come across as stodgy developers rejecting an outsider's ideas without adequate consideration.

This was something that came up fairly often during the Python 3000 mailing list discussions, to the point where I posted a message explaining why the principle of "Status quo wins a stalemate" is a very practical way to avoid meaningless churn in the language design and to cut short design discussions that obviously aren't going anywhere productive.

Python 3000 was already going to have a lot of major changes (most notably, finally improving the non-ASCII text handling story, in a way that means most Python 3 libraries and applications will be more likely to get it right). We needed to ride close herd on the design discussions to try to make sure that gratuitous changes with insufficient long term benefits were avoided.

So, lambda eventually stayed and map() and filter() were retained as builtins, while the attractive nuisance that is reduce() was merely banished to the functools module rather than getting dropped entirely as was originally proposed. PEP 348 was rejected to be replaced by the far less ambitious PEP 352. str.format() was still added, but as a complement to the legacy percent formatting mechanism rather than as a wholesale replacement.

Untold numbers of ideas on the mailing lists and the tracker were dropped with "too much pain for not enough benefit" as the rationale. More recently, PEP 3003 was instituted to enforce a moratorium on core language changes for Python 3.2 in order to give the rest of the community more time to catch up to Python 2.7 and the 3.x series, even though we knew it meant delaying good ideas like the improved generator refactoring capabilities provided by PEP 380.

The fact that Python 3 migration support tools like 2to3, 3to2 and the six module work as well as they do is probably due to this principle of language design as much as it is to any other factor (not to take anything away from the fine work that has gone into implementing them, of course!).

Posting code and syntax highlighting

Before publishing the previous post, I looked into recommendations for syntax highlighting in coding-oriented blogs. In a quick search, syntaxhighlighter showed up repeatedly as the preferred choice, so that's what I went with.

It looks like I'm not the only one that isn't entirely happy with that solution (although by using the "pre" tags rather than "script", my code should at least appear in the RSS feed).

Working with ReST would certainly be easier than the semi-HTML I'm currently using. Still, I think I have plenty to learn about Blogger's formatting tools before I abandon them entirely in favour of preformatted posts (which have their own drawbacks).

Justifying Python language changes

A few years back, I chipped in on python-dev with a review of syntax change proposals that had made it into the language over the years. With Python 3.3 development starting and the language moratorium being lifted, I thought it would be a good time to tidy that up and republish it as a blog post.

Generally speaking, syntactic sugar (or new builtins) need to take a construct in idiomatic Python that is fairly obvious to an experienced Python user and make it obvious to even new users, or else take an idiom that is easy to get wrong when writing (or miss when reading) and make it trivial to use correctly.

Providing significant performance improvements (usually in the form of reduced memory usage or increased speed) also counts heavily in favour of new constructs.

I strongly suggest browsing through past PEPs (both accepted and rejected ones) before proposing syntax changes, but here are some examples of syntactic sugar proposals that were accepted.

List/set/dict comprehensions
(and the reduction builtins any(), all(), min(), max(), sum())

target = [op(x) for x in source]
instead of:
target = []
for x in source:
target.append(op(x))
The transformation (`op(x)`) is far more prominent in the comprehension version, as is the fact that all the loop does is produce a new list. I include the various reduction builtins here, since they serve exactly the same purpose of taking an idiomatic looping construct and turning it into a single expression.

Generator expressions
total = sum(x*x for x in source)
instead of:
def _g(source):
for x in source:
yield x*x
total = sum(_g(x))
or:
total = sum([x*x for x in source])
Here, the GE version has obvious readability gains over the generator function version (as with comprehensions, it brings the operation being applied to each element front and centre instead of burying it in the middle of the code, as well as allowing reduction operations like sum() to retain their prominence), but doesn't actually improve readability significantly over the second LC-based version. The gain over the latter, of course, is that the GE based version needs a lot less memory than the LC version, and, as it consumes the source data
incrementally, can work on source iterators of arbitrary (even infinite) length, and can also cope with source iterators with large time gaps between items (e.g. reading from a socket) as each item will be returned as it becomes available (obviously, the latter two features aren't useful when used in conjunction with reduction operations like sum, but they can be helpful in other contexts).

With statements
with lock:
# perform synchronised operations
instead of:
lock.acquire()
try:
# perform synchronised operations
finally:
lock.release()
This change was a gain for both readability and writability - there were plenty of ways to get this kind of code wrong (e.g. leave out the try-finally altogether, acquire the resource inside the try block instead of before it, call the wrong method or spell the variable name wrong when attempting to release the resource in the finally block), and it wasn't easy to audit because the resource acquisition and release could be separated by an arbitrary number of lines of code. By combining all of that into a single line of code at the beginning of the block, the with statement eliminated a lot of those issues, making the code much easier to write correctly in the first place, and also easier to audit for correctness later (just make sure the code is using the correct context manager for the task at hand).

Function decorators
@classmethod
def f(cls):
# Method body
instead of:
def f(cls):
# Method body
f = classmethod(f)
Easier to write (function name only written once instead of three times), and easier to read (decorator names up top with the function signature instead of buried after the function body). Some folks still dislike the use of the @ symbol, but compared to the drawbacks of the old approach, the dedicated function decorator syntax is a huge improvement.

Conditional expressions
x = A if C else B
instead of:
x = C and A or B
The addition of conditional expressions arguably wasn't a particularly big win for readability, but it was a big win for correctness. The and/or based workaround for the lack of a true conditional expression was not only hard to read if you weren't already familiar with the construct, but using it was also a potential source of bugs if A could ever be False while C was True (in such cases, B would be returned from the expression instead of A).

Except clause
except Exception as ex:
instead of:
except Exception, ex:
Another example of changing the syntax to reduce the potential for non-obvious bugs (in this case, except clauses like `except TypeError, AttributeError:`, that would actually never catch AttributeError, and would locally do AttributeError=TypeError if a TypeError was caught).

Bye-bye Blogilo

OK, when a blogging app can't figure out my blog identity automatically and crashes every time I submit a post (but after submitting the post to blogger), 'tis clearly not the app for me.

I'm just happy the first 3 posts didn't properly include the 'python' tag either, so at least Planet Python shouldn't have been spammed with any noise.

Back to the in-browser editor for now...

To Pycon and beyond...

All these Planet Python posts about interesting talks and info at Pycon finally tipped me over the edge into making the trek across the Pacific to meet some of these people I've been working with online for the past half-dozen years or so.

With 3.3 still 18-24 months away, we should be able to get a pretty good road map thrashed out for ideas we want to explore for possible inclusion. Some face-to-face discussions will be especially handy for me, given the things I'd like to see sorted out: module aliasing to clean up __main__ handling once and for all, bringing back implicit context managers now we have more collective experience with explicit ones, an alternative to PEP 377 that will allow context managers to do some additional setup inside the scope of the try block, clarifying the semantic questions raised by discrepancies between the PEP 3118 buffer API spec and its implementation.

I still have some paperwork to sort out once my renewed passport arrives, but aside from that, the trip is good to go. I did stuff my travel dates up a bit and will have a day to kill in Atlanta on the 9th, but I'm sure I'll be able to figure out something interesting to do :)

Linking sites in blog posts

Call me paranoid, but the idea of trusting a blogging app with my Google account details really doesn't appeal to me. So, "BlogThis!" on the links bar it is.

It would be nice if BlogThis! popped up the full Blogger editor instead of a partial one (missing features like editing the post tags), but using it to save pre-linked drafts should be more than adequate for those occasions when I'm commenting on a link rather than writing something from scratch.

Test: editing an existing post...

Comments update and site to-do list

Having used DISQUS elsewhere as a commenter and not being a great fan of the default Blogger comment system, I've configured the blog to use DISQUS instead.

I've also asked to have the blog's python related posts added to the Planet Python feed, so we'll see how that pans out.

The main thing I'm still not entirely happy with is the site colour scheme - while I'm still a fan of the whole light-text-on-dark-background style, the contrast between clicked links and the current black background really isn't significant enough. I'll probably tinker with that a bit over the next few days.

Some goals for Python 3.3

With 3.2 nearly out the door, it's time to think seriously about goals for Python 3.3 and anything else I'd like to get done on the Python front this year. This post will serve as a to-do list of sorts.

PEP 1 Update
When cleaning up PEP 0 to clear out some of the accumulated cruft in the lists of Meta and Informational PEPs, I ran into a problem where the API specification PEPs use the "Final" state to indicate when consensus has been reached and the API has been locked in. This conflicts with the normal use of the Final state to indicate that a PEP is over a done with and is only being kept around for historical reasons.

A brief discussion on python-dev suggested "Consensus" as a new end state for these PEPs. I like that solution, but adopting it requires an update to PEP 1. I'd like to get to that sometime this year.

PEP 343 and 377, redux
There are a couple of rough edges on the with statement and the associated context management protocol that still bother me.

Firstly, the fact that there is no way for a context manager to skip the body of a with statement means certain constructs simply can't be factored out properly. I previously tried to address this with PEP 377, but that solution was rightly rejected as having too great an impact on common cases which didn't need the extra complexity. I have since thought of an alternative approach that is both more flexible and has a much lower impact on ordinary cases, so it has a higher chance of acceptance.

Secondly, I'd like to revisit the idea of implicit context managers. These were dropped from PEP 343 due largely to terminology problems - we weren't sure whether the term "context manager" referred to objects with enter and exit methods, or to the objects that were able to create such an object on demand. With the meaning of context manager now well established, I believe it should be possible to implement and document this in a way that makes intuitive sense, while making it significantly easier to write context managers that are both stateful and reusable.

That's my __name__, don't wear it out
As per a recent python-ideas discussion, __name__ currently serves multiple masters, which leads to conflicts in certain situations (with __name__ set to a value that is correct for some purposes, but wrong for others). This is especially prevalent with the __main__ module, but can also apply to pseudo-packages, where something is documented as a single unified namespace, but is actually implemented as multiple files combined into a package.

For Python 3.3, I'd like to have a mechanism in place to start sorting this out without breaking every Python script on the planet that relies on the "if __name__ == '__main__':" idiom.

Other PEPs (e.g. PEP 380, 393)
There are a few other PEPs that will hopefully be landing for 3.3, including the subgenerator and memory efficient string PEPs. While I probably won't take much of an active hand in implementing those, there will still be plenty of related python-dev discussion and checkins to review.

And on a completely non-code-related front... with any luck I'll be able to find myself a more directly open source focused job this year. I have the luxury of being fussy in my choice of employment though, so I can happily sit back and relax while waiting to see how that pans out :)