Of Python and Road Maps (or the lack thereof)

I gave my first ever full conference talk at PyCon AU this year, on the topic of how Python evolves over time.

One notable absence from that talk is a discussion of any kind of specific 'road map' for the language as a whole. There's a reason for that: no such document exists. Guido van Rossum (as the language creator and Benevolent Dictator for Life) is the only person who could write an authoritative one and he's generally been happy to sit back and let things evolve organically since guiding the initial phases of the Py3k transition (which is now ticking along quite nicely and the outlook for 2.x being relegated to 'supported-but-now-legacy' status within the 2013/14 time frame is promising). PEP 398, the release PEP for Python 3.3 comes close to qualifying, but is really just a list of PEPs and features, with no context for why those ideas are even on the list.

However, despite that, it seems worthwhile to make the effort to try to write a descriptive road map for where I personally see current efforts going, and where I think the Python core has a useful role to play. Looking at the Python Enhancement Proposal index for the lists of Accepted and Open PEPs can be informative, but it requires an experienced eye to know which proposals are currently being actively championed and have a decent chance of being accepted and included in a CPython release.

When reading the following, please keep in mind that the overall Python community is much bigger than the core development team, so there are plenty of other things going on out there. Others may also have differing opinions on which changes are most likely to come to fruition within the next few years.

Always in Motion is the Future

Unicode is coming. Are you ready?

The Python 2.x series lets many developers sit comfortably in a pure ASCII world until the day when they have to debug a bizarre error caused by their datastream being polluted with non-ASCII data.

That is only going to be a comfortable place to be for so long. More and more components of the computing environment are migrating from ASCII to Unicode, including filesystems, some network protocols, user interface elements, other languages and various other operating system interfaces. With the modern ease of international travel and the rise of the internet, even comfortable "we only speak ASCII in this office, dammit!" environments are eventually going to have to face employees and customers that want to use their own names in their own native, non-Latin character sets rather than being constrained by a protocol invented specifically to handle English.

The Python 3.x series was created largely in response to that environmental transition - in many parts of the world where English isn't the primary spoken language, needing to deal with text in multiple disparate character encodings is already the norm, and having to deal regularly with non-ASCII data is only going to become more common even in English dominant areas.

However, supporting Unicode correctly on this scale is something of a learning experience for the core development team as well. Several parts of the standard library (most notably the email package and wsgiref module) really only adjusted to the new Unicode order of things in the recent 3.2 release, and (in my opinion) we've already made at least one design mistake in leaving various methods on bytes and bytearray objects that assume the underlying data is text encoded as ASCII.

The conversion to Unicode everywhere also came at a measurable cost in reduced speed and increased memory usage within the CPython interpreter. PEP 393 ("Flexible String Representation") is an attempt to recover some of that ground by using a smarter internal representation for string objects.

Easier and more reliable distribution of Python code

Tools to package and distribute Python code have been around in various stages of usability since before the creation of distutils. Currently, there is a concerted effort by the developers of tools like pip, virtualenv, distutils2, distribute and the Python Package index to make it easier to create and distribute packages, to provide better metadata about packages (without requiring execution of arbitrary Python code), to install packages published by others, and to play nicely with distribution level package management techniques.

The new 'packaging' module that will arrive in 3.3 (and will be backported to 3.2 and 2.x under the 'distutils2' name) is a visible sign of that, as are several Python Enhancement Proposals related to packaging metadata standards. These days, the CPython interpreter also supports direct execution of code located in a variety of ways.

There's a significant social and educational component to this effort in addition to the technical underpinnings, so this is going to take some time to resolve itself. However, the eventual end goal of a Python packaging ecosystem that integrates nicely with the various operating system software distribution mechanisms without excessive duplication of effort by developers and software packagers is a worthy one.

Enhanced support for various concurrency mechanisms

The addition of the 'concurrent' package, and its sole member, 'concurrent.futures' to the Python 3.2 standard library can be seen as a statement of intent to provide improved support for a variety of concurrency models within the standard library.

PEP 3153 ("Asynchronous IO support") is an early draft of an API standardisation PEP that will hopefully provide a basis for interoperability between components of the various Python asynchronous frameworks. It's a little hard to follow for non-Twisted developers at this stage, since there's no specific proposed stdlib API, nor any examples showing how it would allow things like plugging a Twisted protocol or transport into the gevent event loop, or vice-versa.

While developers are still ultimately responsible for understanding the interactions between threading and multiprocessing on their platform when using those modules, there are things CPython could be doing to make it easier on them. Jesse Noller, the current maintainer of the multiprocessing package, is looking for someone to take over the responsibility for coordinating the related development efforts (between cross-platform issues and the vagaries of os.fork interactions with threading, this is a non-trivial task, even with the expertise already available amongst the existing core developers).

PEP 3143 ("Standard daemon process library") may be given another look for 3.3, although it isn't clear if sufficient benefit would be gained from having that module in the stdlib rather than published via PyPI as it is currently.

Providing a better native Python experience on Windows

Windows is designed and built on the basis of a mindset that expects most end users to be good little consumers that buy software from developers rather than writing it themselves. Anyone writing software is thus assumed to be doing so for a living, and hence paying for professional development tools. Getting a decent development environment set up for free can be so tedious and painful that many developers (like me!) would rather just install and use Linux for development on our own time, even if we're happy enough to use Windows as a consumer (again, like me - mostly for games in my case). The non-existent "BUNDLE ALL THE THINGS!" approach to dependency management and the lack of any kind of consistent software updating interface just makes this worse, as does the fact that any platform API handling almost always has to special-case Windows, since their libc implementation is one of the more practically useless things on the face of this Earth.

This creates a vicious cycle in open source, where most of the active developers are using either Linux or Mac OS X, so their tools and development processes are focused on those platforms. They're also typically very command line oriented, since the diverse range of graphical environments and the widespread usage of remote terminal sessions makes command line utilities the most reliable and portable. Quite a few new developers coming into the fold run into the Windows roadblocks and either give up or switch to Linux for development (e.g. by grabbing VirtualBox and running a preconfigured Linux VM to avoid any hardware driver issues), thus perpetuating the cycle.

Personally, I blame the cross-platform hobbyist developer hostile nature of the platform on the Windows ecosystem itself and generally refuse to write code on it unless someone's paying me to do so. Fortunately for single-platform Windows users and developers, though, there are other open source devs that don't feel the same way. Some of them are working on tools (such as the script launcher described in PEP 397) that will make it easier to get usable development environments set up for new Python users.

Miscellaneous API improvements in the standard library

There are some areas of the standard library that are notoriously clumsy to use. HTTP/HTTPS comes to mind, as does date/time manipulation and filesystem access.

It's hard to find modules that strike the sweet spot of being well tested, well documented and proven in real world usage, with mature, stable APIs that aren't overly baroque and complicated (yes, yes, I realise the irony of that criticism given the APIs I'm talking about replacing or supplementing). It's even harder to find such modules in the hands of maintainers that are willing to consider supporting them as part of Python's standard library (potentially with backports to older versions of Python) rather than as standalone modules.

PEP 3151 ("Reworking the OS and IO exception hierarchy") is an effort to make filesystem and OS related exceptions easier to generate and handle correctly that is almost certain to appear in Python 3.3 in one form or another.

In the HTTP/HTTPS space, I have some hopes regarding Kenneth Reitz's 'requests' module, but it's going to take more real world experience with it and stabilisation of the API before that idea can be given any serious consideration.

The regex module (available on PyPI) is also often kicked around as a possible candidate for addition. Given the benefits it offers over and above the existing re module, it wouldn't surprise me to see a genuine push for its inclusion by default in 3.3 or some later version.

A basic stats library in the stdlib (to help keep people from implementing their own variants, usually badly) is also a possibility, but again, depends on a suitable candidate being identified and put forward by people willing to maintain it.

The RFC treadmill

Many parts of the Python standard library implement various RFCs and other standards. While our release cycle isn't fast enough to track the newer, still evolving protocols, it's good to provide native support for the latest versions of mature, widely supported ones.

For example, support for sendmg(), recvmsg() and recvmsg_into() methods will be available on socket objects in 3.3 (at least on POSIX-compatible platforms). Additions like PEP 3144 ("IP Address Manipulation Library for the Python Standard Library") are also a potential possibility (although that particular example currently appears to be lacking a champion).

Defining Python-specific interoperability standards

Courtesy of the PEP process, python-dev often plays a role in defining and documenting interfaces that allow various Python frameworks to play nicely together, even if third party libraries are needed to make full use of them. WSGI, the Web Server Gateway Interface, is probably the most well-known example of this (the current version of that standard is documented in PEP 3333). Other examples include the database API, cryptographic algorithm APIs, and, of course, the assorted standards relating to packaging and distribution.

PEP 3333 was a minimalist update to WSGI to make it Python 3 compatible, so, as of 3.2, it's feasible for web frameworks to start considering Python 3 releases (previously, such releases would have been rather challenging, since it was unclear how they should talk to the underlying web server). It likely isn't the final word on the topic though, as the web-sig folks still kick around ideas like PEP 444 and WSGI Lite WSGI Lite. Whether anything actually happens on that front or if we keep chugging along with an attitude of "eh, WSGI gets the job done, quit messing with it" is an open question (and far from my area of expertise).

One area that is being actively worked on, and will hopefully improve significantly in Python 3.3, is the low level buffer interface protocol defined by PEP 3118. It turned out that there were a few flaws in the way that PEP was implemented in CPython, so even though it did achieve the basic goal of allowing projects like NumPy and PIL to interoperate without needing to copy large data buffers around, there are still some rather rough edges in the way the protocol is exposed to Python code via memoryview objects, as well as in the respective obligations of code that produces and consumes these buffers. More details can be found on the issue tracker if you're interested in the gory details of defining conventions for reliable management of shared dynamically allocated memory in C and C++ code :)

Enhancing and refining the import system

The importer protocol defined by PEP 302 was an ambitious attempt to decouple Python's import semantics from the underlying filesystem. It never fully succeeded - there's still a lot of Python code (including one or two components in the standard library!) that assumes the classical directories-on-a-disk layout for Python packages. That package layout, with the requirement for __init__.py files and restriction to a single directory for each package, is itself surprising and unintuitive for many Python novices.

There are a few suggestions in the works ultimately aimed not only at cleaning up how all this is implemented, but also at further decoupling the module hierarchy from the on-disk file layout. There are significant challenges in doing this in a way that makes life easier for developers writing new packages, while also allowing those writing tools that manipulate packages to more easily do the right thing, but it's an area that can definitely do with some rationalisation.

Improved tools for factoring out and reusing code

Python 3.3 will at least bring with it PEP 380's "yield from" expression which makes it easy to take part of a generator and split it out into a subgenerator without affecting the overall semantics of the original generator (doing this correctly in the general case is currently significantly harder than you might think - see the PEP for details).

I suspect that the next few years may bring some more tweaks to generators, generator expressions and context managers to make for loops and with statements even more powerful utilities for factoring out code. However, any adjustments in this area will be carefully balanced against the need for language stability and keeping things reasonably easy to learn.

Not Even on the Radar

In addition to the various suggestions mentioned in PEP 3099 ("Things that Will Not Change in Python 3000"), there are a couple of specific items I think are worth calling out as explicitly not being part of any realistic road map:

Major internal architectural changes for CPython

Questions like 'Why not remove the GIL?', 'Why not switch to a register based VM?', 'Why not add a JIT to CPython?' and 'Why not make PyPy the reference interpreter?' don't really have straightforward obvious answers other than "that's harder than you might think and probably less beneficial than you might hope", so I expect people to continue asking them indefinitely. However, I also don't see any significant changes coming on any of these fronts any time soon.

One of the key advantages of the CPython interpreter as a reference implementation is that it is, fundamentally, quite a simple beast (despite a few highly sophisticated corners). If we can do things, chances are that the other implementations are also going to be able to support them. By contrast, the translation stage in their toolchain means that PyPy can contemplate features like the existing JIT, or their current exploration of Software Transactional Memory for removing their Global Interpreter Lock in a useful way that simply aren't feasible with less sophisticated tools (or a much bigger development team).

Personally, I think the status quo in this space is in a pretty good place, with python-dev and CPython handling the evolution of the language specification itself, as well as providing an implementation that will work reasonably well on almost any platform with a C compiler (and preferably some level of POSIX compliance), while the PyPy crew focus on providing a fast, customisable implementation for the major platforms without getting distracted by arguments about possible new language features.

More feature backports to the 2.x series

Aside from the serious backwards compatibility problems accompanying the Unicode transition, the Py3k transition was also strongly influenced by the concept of paying down technical debt. Having legacy cruft lying around made it harder to introduce language improvements, since the additional interactions created more complexity to deal with, to the point where people just didn't want to bother any more.

By clearing out a bunch of that legacy baggage, we created a better platform for new improvements, like more pervasive (and efficient!) Unicode support, even richer metaclass capabilities, exception chaining, more intuitive division semantics and so on.

The cost, borne mostly by existing Python developers, is a long, slow transition over several years as people deal with the process of not only checking whether the automated tools can correctly handle their own code, but also waiting for all of their dependencies to be available on the updated platform. This actually seems to be going fairly well so far, even though people can be quite vocal in expressing their impatience with the current rate of progress.

Now, all of the major Python implementations are open source, so it's certainly possible for motivated developers to fork one of those implementations and start backporting features of interest from 3.x that aren't already available as downloads from PyPI (e.g. it's easy to download unittest2 to get access to 3.x unittest enhancements, so there's no reason to fork just for that kind of backport). However, anyone doing so will be taking up the task in full knowledge of the fact that the existing CPython development team found that process annoying and tedious enough that we got tired of doing it after two releases (2.6 and 2.7).

So, while I definitely expect a certain level of ongoing griping about this point, but I don't expect it to rise to the level of anyone doing the work to backport things like function annotations or exception chaining to the 2.x series.

Update 1: Added missing section about RFC support
Update 2: Fix sentence about role of technical debt in Py3k transition
Update 3: Added a link to a relevant post from Guido regarding the challenges of maintaining official road maps vs one-off "where are we now?" snapshots in time (like this post)
Update 4: Added missing section on defining Python-specific standards

Comments

Comments powered by Disqus