Posts/2011/08

Open Source, Windows and Teaching Python to New Developers

A few questions and incidents recently prompted me to reflect on why I don't help with CPython support on Windows, even though I use Windows happily enough on my gaming system. Since this ended up being a rather pro-Linux article and upfront disclosure is a good thing, I'll note that while I do work for Red Hat now, that's a very recent thing - my adoption of Linux as my preferred development platform dates back to 2004 or so. I work for Red Hat because I like Linux, not the other way around :)

The Availability of Professional Development Tools

I don't make a secret of my dislike of Windows as a hobbyist development platform. While Microsoft have improved things in recent years (primarily by releasing the Express editions of Visual Studio), there's still a huge difference between an operating system like GNU/Linux, which was built by developers for developers based on a foundation that was built by academics for academics, and Windows, which was built by a company that used deals with computer manufacturers to get it into end users' hands regardless of technical merit. Developers were forced to follow in order to reach that large installed user base. Those different histories are reflected in the different development cultures that surround the respective platforms.

To get the same tool chain that professional Linux companies use, you don't need to do anything special - Linux distributions include the tools used to create them. If you have a distribution, you have everything you need to build applications for that distribution, including documentation. With the open source nature of the platform and almost all of the software (the occasionally binary driver notwithstanding), there's a vast range of tools out there to help you get things done (although sorting through the mass can be a little tricky sometimes, since it can be hard to tell the difference between stuff that doesn't exist and stuff that exists, but hasn't been uncovered by your research).

As far as I'm aware, Mac OS X isn't quite as generous with freely available development utilities, but isn't all that far off the Linux approach (I'm not a Mac user or developer though, so there may be more hurdles than I am aware of - I recall some muttering about Apple beginning to charge a small fee for XCode. My opinion is based mostly on the fact that it seems pretty easy to find open source devs that use Macs). With the POSIX-ish underpinnings, many of the utilities from the *nix world also work in this environment.

The minimum realistic standard for professional Windows development, though, is an MSDN subscription (to get full access to the OS documentation and various utilities), along with a professional copy of Visual Studio. The tools available for free (including the Express editions of Visual Studio) are clearly second rate. Even when the tools themselves are OK, the licensing restrictions on the applications they create may make them practically useless (and MS have the gall to call the GPL viral - at least the gcc team don't restrict how you license and distribute the binaries it creates). So why should a hobbyist develop for a system that thinks they should pay substantial sums for the privilege of developing for it, instead of one that welcomes all contributors, providing not only the end product, but the ingredients and recipes all for free?

At the recent PyConAU sprints, one of the contributors (an existing Linux user that happened to have a Windows only laptop with them) became frustrated with getting all the necessary tools set up to work properly on Windows (configuring git+ssh for read/write access to a GitHub repo was one key point of irritation), and decided to dual boot Ubuntu on the machine instead. Twenty minutes later, she was up and running and hacking on the project she originally wanted to hack on. Granted, she already knew how to use Linux, but seriously, there's something fundamentally wrong with a platform when installing and dual-booting to a different OS is the easiest way to get a decent development environment up and running.

All that ends up putting cross-platform languages like Python in an interesting position: when developing with Python, you can often get away with not understanding the underlying details of your operating system, because the language runtime tries to provide a largely standardised interface on all platforms. However, many open source developers either don't use Windows at all, or genuinely dislike programming for it, so the burden of making things work properly on Windows falls on the shoulders of a comparatively small number of people, either those who genuinely like programming for the platform (yes, such people exist, I'm just not one of them), or those that are looking for any niche where they can usefully contribute and are happy enough to take on the task of improving Windows compatibility and support.

I don't have particularly hard numbers to back this up (other than the skew in core developer numbers vs overall OS popularity), but my intuition is that, at least for CPython, the user:core developer ratio is orders of magnitude higher for Windows than it is for Linux or Mac OS X.

The Implications for Teaching Python on Windows

Something cool that is going on at the moment is that a lot of folks are interested in the idea of teaching more people how to program with Python as the language used. However, the potential students (young and old) that they are wanting to teach often don't have any development experience at all and are using the most common consumer operating system (i.e. Windows). So good Windows support, and an easy installation experience are important considerations for these instructors. A request that is frequently made (with varying levels of respect and politeness), is that the official python.org Windows installer be updated to automatically adjust the PATH (or at least provide the option to do so), so that Python can be launched from the command line by typing "python" instead of something like "C:\Python27\python".

If educators want that right now their best bet is actually to direct their students towards the Windows versions of ActiveState's ActivePython Community Edition. ActiveState add a few things to the standard installer, like PATH manipulation and additional packages (such as pywin32). They also bundle PyPM, which is a decent tool for getting PyPI packages on to Windows machines (at least, I've heard good things about it - I haven't actually used it myself). (That said, I believe I may need to caveat that recommendation a bit: as near as I can tell from their website, PyPM has been deliberately disabled for their 64-bit Windows Community Edition installer. Still, even in that case, you can easily grab additional packages direct from PyPI via "pip install" on the command line)

Brian Curtin is working on adding optional PATH manipulation to the python.org installer for 3.3, and there's a chance such a change might be backported to the next maintenance releases for 3.2 and 2.7 (no promises, though). Even if it does make it in, it will still be a while before the change is part of a binary release (especially given that Brian has only just started tinkering with it).

This is clearly a nice thing for beginners, especially those that aren't in the habit of tinkering with their OS settings, but I do honestly wonder how much of a difference it will make in the long run. In many ways, software development is one long exercise in frustration. You decide you want to fix bug X. But it turns out bug X is really due to bug Y. You could work around Y just to fix X, but the bigger bug would still be there. But then you discover that fixing bug Y properly requires feature Z, which doesn't exist yet, so a workaround (even an ugly one) starts to sound pretty attractive. "Yak shaving" (the highly technical term for things that you're working on solely because they're a prerequisite for what you actually want to be working on) is so common it's almost the norm rather than the exception. The many and varied frustrations of trying to use Windows as a hobbyist open source developer also won't magically go away just because the python.org installer starts automating one environment variable update - as soon as people are introduced to sites like GitHub and BitBucket, they'll get to discover the joy that is SSH and source control on Windows. If they get past that hurdle, they'll likely start to encounter the multitude open source projects that don't even offer Windows installers (if the project supports Windows at all), because their Windows developer count stands at a grand total of zero.

Final Thoughts

I hope the people teaching Python to beginners on Windows and the folks working on improving Windows support don't take this article as an attack on their efforts. I find both goals to be quite admirable, and wish those involved all the success they can find. But there are reasons I abandoned Windows as a personal development platform ~7 years ago and taught myself to use Linux instead. As far as I can tell, most of those reasons remain valid today, even after Microsoft started releasing the Express versions of Visual Studio in an attempt to stem the flood of hobbyist developers jumping ship.

The other day I called the relative lack of Windows developers in open source a vicious cycle and I stand by that. If someone can learn to program, mastering Linux is going to be comparatively easy. For anyone seriously interested in open source development, using Linux (even in a virtual machine, the way I do on my gaming laptop) is by far the path of least resistance. Getting more Windows developers in open source requires that people care sufficiently about Windows as a platform that they don't just switch to Linux, but care about open source enough to start contributing at all, and that seems to be a genuinely rare combination.

Scripting languages and suitable complexity

Steven Lott is a Python developer and blogger that I first came across via his prolific contributions to answering questions on Stack Overflow, and then later by reading his blog posts that appeared on Planet Python. His Building Skills in Python book is a resource I've recently started suggesting newcomers to Python take a look at to see if his style works for them.

Some time ago, he posted about what he called the "Curse of Procedural Design": the fact that, beyond a certain point, purely procedural code typically starts drowning in complexity and becomes an unmaintainable mess. Based on that, he then started questioning whether or not he was doing the right thing by teaching the procedural aspects of Python first and leaving the introduction of object oriented concepts until later in the book.

Personally, I think starting with procedural programming is still the right thing to do. However, one of the key points of object-oriented design is that purely procedural programming doesn't scale well. Even in a purely procedural language, large programs almost always define essential data structures, and functions that work on those data structures, effectively writing object oriented code by convention. Object support built into the language makes this easier but it isn't essential (as large C projects like the Linux kernel and CPython itself demonstrate).

Where languages like Java can be an issue as beginner languages is that by requiring that all code be compiled in advance and written in an object oriented style, they set a minimum level of complexity for all programs written in that language. Understanding even the most trivial program in Java requires that you grasp the concepts of a module, a class, an instance, a method and an expression. Using a compiled procedural language instead at least lets you simplify that a bit, as you only need to understand modules, functions and expressions.

But the scripting languages? By means of a read-eval-print loop in an interactive interpreter, they let you do things in the right order, starting with the minimum level of complexity possible: a single expression.

For convenience, you may then introduce the concept of a 'script' early, but with an appropriate editor, scripts may be run directly in the application (giving an experience very similar to a REPL prompt) rather than worrying about command line invocation.

More sophisticated algorithms can then be introduced by discussing conditional execution and repetition (if statements and loops), but still without any need to make a distinction between "definition time" and "execution time".

Then, once the concept of algorithms has been covered, we can start to modularise blocks of execution as functions and introduce the idea that algorithms can be stored for use in multiple places so that "definition time" and "execution time" may be separated.

Then we start to modularise data and the associated operations on that data as classes, and explore the ways that instances allow the same operations to readily be performed on different data sets.

Then we start to modularise collections of classes (and potentially data and standalone functions) as separate modules (and, in the case of Python, this can be a good time to introduce the idea of "compilation time" as separate from both "definition time" and "execution time").

Continuing up the complexity scale, modules may then be bundled into packages, and packages into frameworks and applications (introducing "build time" and "installation time" as two new potentially important phases in program execution).

A key part of the art of software design is learning how to choose an appropriate level of complexity for the problem at hand - when a problem calls for a simple script, throwing an entire custom application at it would be overkill. On the other hand, trying to write complex applications using only scripts and no higher level constructs will typically lead to an unmaintainable mess.

In my opinion, the primary reason that scripting languages are easier to learn for many people is that they permit you to start immediately with code that "does things", allowing the introduction of the "function" and "class" abstractions to be deferred until later.

Starting with C and Java, on the other hand, always requires instructors to say "Oh, don't worry about that boilerplate, you'll learn what it means later" before starting in with the explanation of what can go inside a main() function or method. The "compilation time" vs "execution time" distinction also has to be introduced immediately, rather than being deferred until some later point in the material. There's also the fact that such languages are actually usually at least two languages in one: the top level "compile time" language that you use to define your data structures, functions and modules and the "run time" language that you actually use inside functions and methods to get work done. Scripting languages don't generally have that distinction - the top level language is the same as the language used inside functions (in fact, that's my main criteria for whether or not I consider a language to be a scripting language in the first place).

Of Python and Road Maps (or the lack thereof)

I gave my first ever full conference talk at PyCon AU this year, on the topic of how Python evolves over time.

One notable absence from that talk is a discussion of any kind of specific 'road map' for the language as a whole. There's a reason for that: no such document exists. Guido van Rossum (as the language creator and Benevolent Dictator for Life) is the only person who could write an authoritative one and he's generally been happy to sit back and let things evolve organically since guiding the initial phases of the Py3k transition (which is now ticking along quite nicely and the outlook for 2.x being relegated to 'supported-but-now-legacy' status within the 2013/14 time frame is promising). PEP 398, the release PEP for Python 3.3 comes close to qualifying, but is really just a list of PEPs and features, with no context for why those ideas are even on the list.

However, despite that, it seems worthwhile to make the effort to try to write a descriptive road map for where I personally see current efforts going, and where I think the Python core has a useful role to play. Looking at the Python Enhancement Proposal index for the lists of Accepted and Open PEPs can be informative, but it requires an experienced eye to know which proposals are currently being actively championed and have a decent chance of being accepted and included in a CPython release.

When reading the following, please keep in mind that the overall Python community is much bigger than the core development team, so there are plenty of other things going on out there. Others may also have differing opinions on which changes are most likely to come to fruition within the next few years.

Always in Motion is the Future

Unicode is coming. Are you ready?

The Python 2.x series lets many developers sit comfortably in a pure ASCII world until the day when they have to debug a bizarre error caused by their datastream being polluted with non-ASCII data.

That is only going to be a comfortable place to be for so long. More and more components of the computing environment are migrating from ASCII to Unicode, including filesystems, some network protocols, user interface elements, other languages and various other operating system interfaces. With the modern ease of international travel and the rise of the internet, even comfortable "we only speak ASCII in this office, dammit!" environments are eventually going to have to face employees and customers that want to use their own names in their own native, non-Latin character sets rather than being constrained by a protocol invented specifically to handle English.

The Python 3.x series was created largely in response to that environmental transition - in many parts of the world where English isn't the primary spoken language, needing to deal with text in multiple disparate character encodings is already the norm, and having to deal regularly with non-ASCII data is only going to become more common even in English dominant areas.

However, supporting Unicode correctly on this scale is something of a learning experience for the core development team as well. Several parts of the standard library (most notably the email package and wsgiref module) really only adjusted to the new Unicode order of things in the recent 3.2 release, and (in my opinion) we've already made at least one design mistake in leaving various methods on bytes and bytearray objects that assume the underlying data is text encoded as ASCII.

The conversion to Unicode everywhere also came at a measurable cost in reduced speed and increased memory usage within the CPython interpreter. PEP 393 ("Flexible String Representation") is an attempt to recover some of that ground by using a smarter internal representation for string objects.

Easier and more reliable distribution of Python code

Tools to package and distribute Python code have been around in various stages of usability since before the creation of distutils. Currently, there is a concerted effort by the developers of tools like pip, virtualenv, distutils2, distribute and the Python Package index to make it easier to create and distribute packages, to provide better metadata about packages (without requiring execution of arbitrary Python code), to install packages published by others, and to play nicely with distribution level package management techniques.

The new 'packaging' module that will arrive in 3.3 (and will be backported to 3.2 and 2.x under the 'distutils2' name) is a visible sign of that, as are several Python Enhancement Proposals related to packaging metadata standards. These days, the CPython interpreter also supports direct execution of code located in a variety of ways.

There's a significant social and educational component to this effort in addition to the technical underpinnings, so this is going to take some time to resolve itself. However, the eventual end goal of a Python packaging ecosystem that integrates nicely with the various operating system software distribution mechanisms without excessive duplication of effort by developers and software packagers is a worthy one.

Enhanced support for various concurrency mechanisms

The addition of the 'concurrent' package, and its sole member, 'concurrent.futures' to the Python 3.2 standard library can be seen as a statement of intent to provide improved support for a variety of concurrency models within the standard library.

PEP 3153 ("Asynchronous IO support") is an early draft of an API standardisation PEP that will hopefully provide a basis for interoperability between components of the various Python asynchronous frameworks. It's a little hard to follow for non-Twisted developers at this stage, since there's no specific proposed stdlib API, nor any examples showing how it would allow things like plugging a Twisted protocol or transport into the gevent event loop, or vice-versa.

While developers are still ultimately responsible for understanding the interactions between threading and multiprocessing on their platform when using those modules, there are things CPython could be doing to make it easier on them. Jesse Noller, the current maintainer of the multiprocessing package, is looking for someone to take over the responsibility for coordinating the related development efforts (between cross-platform issues and the vagaries of os.fork interactions with threading, this is a non-trivial task, even with the expertise already available amongst the existing core developers).

PEP 3143 ("Standard daemon process library") may be given another look for 3.3, although it isn't clear if sufficient benefit would be gained from having that module in the stdlib rather than published via PyPI as it is currently.

Providing a better native Python experience on Windows

Windows is designed and built on the basis of a mindset that expects most end users to be good little consumers that buy software from developers rather than writing it themselves. Anyone writing software is thus assumed to be doing so for a living, and hence paying for professional development tools. Getting a decent development environment set up for free can be so tedious and painful that many developers (like me!) would rather just install and use Linux for development on our own time, even if we're happy enough to use Windows as a consumer (again, like me - mostly for games in my case). The non-existent "BUNDLE ALL THE THINGS!" approach to dependency management and the lack of any kind of consistent software updating interface just makes this worse, as does the fact that any platform API handling almost always has to special-case Windows, since their libc implementation is one of the more practically useless things on the face of this Earth.

This creates a vicious cycle in open source, where most of the active developers are using either Linux or Mac OS X, so their tools and development processes are focused on those platforms. They're also typically very command line oriented, since the diverse range of graphical environments and the widespread usage of remote terminal sessions makes command line utilities the most reliable and portable. Quite a few new developers coming into the fold run into the Windows roadblocks and either give up or switch to Linux for development (e.g. by grabbing VirtualBox and running a preconfigured Linux VM to avoid any hardware driver issues), thus perpetuating the cycle.

Personally, I blame the cross-platform hobbyist developer hostile nature of the platform on the Windows ecosystem itself and generally refuse to write code on it unless someone's paying me to do so. Fortunately for single-platform Windows users and developers, though, there are other open source devs that don't feel the same way. Some of them are working on tools (such as the script launcher described in PEP 397) that will make it easier to get usable development environments set up for new Python users.

Miscellaneous API improvements in the standard library

There are some areas of the standard library that are notoriously clumsy to use. HTTP/HTTPS comes to mind, as does date/time manipulation and filesystem access.

It's hard to find modules that strike the sweet spot of being well tested, well documented and proven in real world usage, with mature, stable APIs that aren't overly baroque and complicated (yes, yes, I realise the irony of that criticism given the APIs I'm talking about replacing or supplementing). It's even harder to find such modules in the hands of maintainers that are willing to consider supporting them as part of Python's standard library (potentially with backports to older versions of Python) rather than as standalone modules.

PEP 3151 ("Reworking the OS and IO exception hierarchy") is an effort to make filesystem and OS related exceptions easier to generate and handle correctly that is almost certain to appear in Python 3.3 in one form or another.

In the HTTP/HTTPS space, I have some hopes regarding Kenneth Reitz's 'requests' module, but it's going to take more real world experience with it and stabilisation of the API before that idea can be given any serious consideration.

The regex module (available on PyPI) is also often kicked around as a possible candidate for addition. Given the benefits it offers over and above the existing re module, it wouldn't surprise me to see a genuine push for its inclusion by default in 3.3 or some later version.

A basic stats library in the stdlib (to help keep people from implementing their own variants, usually badly) is also a possibility, but again, depends on a suitable candidate being identified and put forward by people willing to maintain it.

The RFC treadmill

Many parts of the Python standard library implement various RFCs and other standards. While our release cycle isn't fast enough to track the newer, still evolving protocols, it's good to provide native support for the latest versions of mature, widely supported ones.

For example, support for sendmg(), recvmsg() and recvmsg_into() methods will be available on socket objects in 3.3 (at least on POSIX-compatible platforms). Additions like PEP 3144 ("IP Address Manipulation Library for the Python Standard Library") are also a potential possibility (although that particular example currently appears to be lacking a champion).

Defining Python-specific interoperability standards

Courtesy of the PEP process, python-dev often plays a role in defining and documenting interfaces that allow various Python frameworks to play nicely together, even if third party libraries are needed to make full use of them. WSGI, the Web Server Gateway Interface, is probably the most well-known example of this (the current version of that standard is documented in PEP 3333). Other examples include the database API, cryptographic algorithm APIs, and, of course, the assorted standards relating to packaging and distribution.

PEP 3333 was a minimalist update to WSGI to make it Python 3 compatible, so, as of 3.2, it's feasible for web frameworks to start considering Python 3 releases (previously, such releases would have been rather challenging, since it was unclear how they should talk to the underlying web server). It likely isn't the final word on the topic though, as the web-sig folks still kick around ideas like PEP 444 and WSGI Lite WSGI Lite. Whether anything actually happens on that front or if we keep chugging along with an attitude of "eh, WSGI gets the job done, quit messing with it" is an open question (and far from my area of expertise).

One area that is being actively worked on, and will hopefully improve significantly in Python 3.3, is the low level buffer interface protocol defined by PEP 3118. It turned out that there were a few flaws in the way that PEP was implemented in CPython, so even though it did achieve the basic goal of allowing projects like NumPy and PIL to interoperate without needing to copy large data buffers around, there are still some rather rough edges in the way the protocol is exposed to Python code via memoryview objects, as well as in the respective obligations of code that produces and consumes these buffers. More details can be found on the issue tracker if you're interested in the gory details of defining conventions for reliable management of shared dynamically allocated memory in C and C++ code :)

Enhancing and refining the import system

The importer protocol defined by PEP 302 was an ambitious attempt to decouple Python's import semantics from the underlying filesystem. It never fully succeeded - there's still a lot of Python code (including one or two components in the standard library!) that assumes the classical directories-on-a-disk layout for Python packages. That package layout, with the requirement for __init__.py files and restriction to a single directory for each package, is itself surprising and unintuitive for many Python novices.

There are a few suggestions in the works ultimately aimed not only at cleaning up how all this is implemented, but also at further decoupling the module hierarchy from the on-disk file layout. There are significant challenges in doing this in a way that makes life easier for developers writing new packages, while also allowing those writing tools that manipulate packages to more easily do the right thing, but it's an area that can definitely do with some rationalisation.

Improved tools for factoring out and reusing code

Python 3.3 will at least bring with it PEP 380's "yield from" expression which makes it easy to take part of a generator and split it out into a subgenerator without affecting the overall semantics of the original generator (doing this correctly in the general case is currently significantly harder than you might think - see the PEP for details).

I suspect that the next few years may bring some more tweaks to generators, generator expressions and context managers to make for loops and with statements even more powerful utilities for factoring out code. However, any adjustments in this area will be carefully balanced against the need for language stability and keeping things reasonably easy to learn.

Not Even on the Radar

In addition to the various suggestions mentioned in PEP 3099 ("Things that Will Not Change in Python 3000"), there are a couple of specific items I think are worth calling out as explicitly not being part of any realistic road map:

Major internal architectural changes for CPython

Questions like 'Why not remove the GIL?', 'Why not switch to a register based VM?', 'Why not add a JIT to CPython?' and 'Why not make PyPy the reference interpreter?' don't really have straightforward obvious answers other than "that's harder than you might think and probably less beneficial than you might hope", so I expect people to continue asking them indefinitely. However, I also don't see any significant changes coming on any of these fronts any time soon.

One of the key advantages of the CPython interpreter as a reference implementation is that it is, fundamentally, quite a simple beast (despite a few highly sophisticated corners). If we can do things, chances are that the other implementations are also going to be able to support them. By contrast, the translation stage in their toolchain means that PyPy can contemplate features like the existing JIT, or their current exploration of Software Transactional Memory for removing their Global Interpreter Lock in a useful way that simply aren't feasible with less sophisticated tools (or a much bigger development team).

Personally, I think the status quo in this space is in a pretty good place, with python-dev and CPython handling the evolution of the language specification itself, as well as providing an implementation that will work reasonably well on almost any platform with a C compiler (and preferably some level of POSIX compliance), while the PyPy crew focus on providing a fast, customisable implementation for the major platforms without getting distracted by arguments about possible new language features.

More feature backports to the 2.x series

Aside from the serious backwards compatibility problems accompanying the Unicode transition, the Py3k transition was also strongly influenced by the concept of paying down technical debt. Having legacy cruft lying around made it harder to introduce language improvements, since the additional interactions created more complexity to deal with, to the point where people just didn't want to bother any more.

By clearing out a bunch of that legacy baggage, we created a better platform for new improvements, like more pervasive (and efficient!) Unicode support, even richer metaclass capabilities, exception chaining, more intuitive division semantics and so on.

The cost, borne mostly by existing Python developers, is a long, slow transition over several years as people deal with the process of not only checking whether the automated tools can correctly handle their own code, but also waiting for all of their dependencies to be available on the updated platform. This actually seems to be going fairly well so far, even though people can be quite vocal in expressing their impatience with the current rate of progress.

Now, all of the major Python implementations are open source, so it's certainly possible for motivated developers to fork one of those implementations and start backporting features of interest from 3.x that aren't already available as downloads from PyPI (e.g. it's easy to download unittest2 to get access to 3.x unittest enhancements, so there's no reason to fork just for that kind of backport). However, anyone doing so will be taking up the task in full knowledge of the fact that the existing CPython development team found that process annoying and tedious enough that we got tired of doing it after two releases (2.6 and 2.7).

So, while I definitely expect a certain level of ongoing griping about this point, but I don't expect it to rise to the level of anyone doing the work to backport things like function annotations or exception chaining to the 2.x series.

Update 1: Added missing section about RFC support
Update 2: Fix sentence about role of technical debt in Py3k transition
Update 3: Added a link to a relevant post from Guido regarding the challenges of maintaining official road maps vs one-off "where are we now?" snapshots in time (like this post)
Update 4: Added missing section on defining Python-specific standards