The Python Packaging Ecosystem

There have been a few recent articles reflecting on the current status of the Python packaging ecosystem from an end user perspective, so it seems worthwhile for me to write-up my perspective as one of the lead architects for that ecosystem on how I characterise the overall problem space of software publication and distribution, where I think we are at the moment, and where I'd like to see us go in the future.

For context, the specific articles I'm replying to are:

These are all excellent pieces considering the problem space from different perspectives, so if you'd like to learn more about the topics I cover here, I highly recommend reading them.

My core software ecosystem design philosophy

Since it heavily influences the way I think about packaging system design in general, it's worth stating my core design philosophy explicitly:

  • As a software consumer, I should be able to consume libraries, frameworks, and applications in the binary format of my choice, regardless of whether or not the relevant software publishers directly publish in that format
  • As a software publisher working in the Python ecosystem, I should be able to publish my software once, in a single source-based format, and have it be automatically consumable in any binary format my users care to use

This is emphatically not the way many software packaging systems work - for a great many systems, the publication format and the consumption format are tightly coupled, and the folks managing the publication format or the consumption format actively seek to use it as a lever of control over a commercial market (think operating system vendor controlled application stores, especially for mobile devices).

While we're unlikely to ever pursue the specific design documented in the rest of the PEP (hence the "Deferred" status), the "Development, Distribution, and Deployment of Python Software" section of PEP 426 provides additional details on how this philosophy applies in practice.

I'll also note that while I now work on software supply chain management tooling at Red Hat, that wasn't the case when I first started actively participating in the upstream Python packaging ecosystem design process. Back then I was working on Red Hat's main hardware integration testing system, and growing increasingly frustrated with the level of effort involved in integrating new Python level dependencies into Beaker's RPM based development and deployment model. Getting actively involved in tackling these problems on the Python upstream side of things then led to also getting more actively involved in addressing them on the Red Hat downstream side.

The key conundrum

When talking about the design of software packaging ecosystems, it's very easy to fall into the trap of only considering the "direct to peer developers" use case, where the software consumer we're attempting to reach is another developer working in the same problem domain that we are, using a similar set of development tools. Common examples of this include:

  • Linux distro developers publishing software for use by other contributors to the same Linux distro ecosystem
  • Web service developers publishing software for use by other web service developers
  • Data scientists publishing software for use by other data scientists

In these more constrained contexts, you can frequently get away with using a single toolchain for both publication and consumption:

  • Linux: just use the system package manager for the relevant distro
  • Web services: just use the Python Packaging Authority's twine for publication and pip for consumption
  • Data science: just use conda for everything

For newer languages that start in one particular domain with a preferred package manager and expand outwards from there, the apparent simplicity arising from this homogeneity of use cases may frequently be attributed as an essential property of the design of the package manager, but that perception of inherent simplicity will typically fade if the language is able to successfully expand beyond the original niche its default package manager was designed to handle.

In the case of Python, for example, distutils was designed as a consistent build interface for Linux distro package management, setuptools for plugin management in the Open Source Application Foundation's Chandler project, pip for dependency management in web service development, and conda for local language-independent environment management in data science. distutils and setuptools haven't fared especially well from a usability perspective when pushed beyond their original design parameters (hence the current efforts to make it easier to use full-fledged build systems like Scons and Meson as an alternative when publishing Python packages), while pip and conda both seem to be doing a better job of accommodating increases in their scope of application.

This history helps illustrate that where things really have the potential to get complicated (even beyond the inherent challenges of domain-specific software distribution) is when you start needing to cross domain boundaries. For example, as the lead maintainer of contextlib in the Python standard library, I'm also the maintainer of the contextlib2 backport project on PyPI. That's not a domain specific utility - folks may need it regardless of whether they're using a self-built Python runtime, a pre-built Windows or Mac OS X binary they downloaded from python.org, a pre-built binary from a Linux distribution, a CPython runtime from some other redistributor (homebrew, pyenv, Enthought Canopy, ActiveState, Continuum Analytics, AWS Lambda, Azure Machine Learning, etc), or perhaps even a different Python runtime entirely (PyPy, PyPy.js, Jython, IronPython, MicroPython, VOC, Batavia, etc).

Fortunately for me, I don't need to worry about all that complexity in the wider ecosystem when I'm specifically wearing my contextlib2 maintainer hat - I just publish an sdist and a universal wheel file to PyPI, and the rest of the ecosystem has everything it needs to take care of redistribution and end user consumption without any further input from me.

However, contextlib2 is a pure Python project that only depends on the standard library, so it's pretty much the simplest possible case from a tooling perspective (the only reason I needed to upgrade from distutils to setuptools was so I could publish my own wheel files, and the only reason I haven't switched to using the much simpler pure-Python-only flit instead of either of them is that that doesn't yet easily support publishing backwards compatible setup.py based sdists).

This means that things get significantly more complex once we start wanting to use and depend on components written in languages other than Python, so that's the broader context I'll consider next.

Platform management or plugin management?

When it comes to handling the software distribution problem in general, there are two main ways of approaching it:

  • design a plugin management system that doesn't concern itself with the management of the application framework that runs the plugins
  • design a platform component manager that not only manages the plugins themselves, but also the application frameworks that run them

This "plugin manager or platform component manager?" question shows up over and over again in software distribution architecture designs, but the case of most relevance to Python developers is in the contrasting approaches that pip and conda have adopted to handling the problem of external dependencies for Python projects:

  • pip is a plugin manager for Python runtimes. Once you have a Python runtime (any Python runtime), pip can help you add pieces to it. However, by design, it won't help you manage the underlying Python runtime (just as it wouldn't make any sense to try to install Mozilla Firefox as a Firefox Add-On, or Google Chrome as a Chrome Extension)
  • conda, by contrast, is a component manager for a cross-platform platform that provides its own Python runtimes (as well as runtimes for other languages). This means that you can get pre-integrated components, rather than having to do your own integration between plugins obtained via pip and language runtimes obtained via other means

What this means is that pip, on its own, is not in any way a direct alternative to conda. To get comparable capabilities to those offered by conda, you have to add in a mechanism for obtaining the underlying language runtimes, which means the alternatives are combinations like:

  • apt-get + pip
  • dnf + pip
  • yum + pip
  • pyenv + pip
  • homebrew (Mac OS X) + pip
  • python.org Windows installer + pip
  • Enthought Canopy
  • ActiveState's Python runtime + PyPM

This is the main reason why "just use conda" is excellent advice to any prospective Pythonista that isn't already using one of the platform component managers mentioned above: giving that answer replaces an otherwise operating system dependent or Python specific answer to the runtime management problem with a cross-platform and (at least somewhat) language neutral one.

It's an especially good answer for Windows users, as chocalatey/OneGet/Windows Package Management isn't remotely comparable to pyenv or homebrew at this point in time, other runtime managers don't work on Windows, and getting folks bootstrapped with MinGW, Cygwin or the new (still experimental) Windows Subsystem for Linux is just another hurdle to place between them and whatever goal they're learning Python for in the first place.

However, conda's pre-integration based approach to tackling the external dependency problem is also why "just use conda for everything" isn't a sufficient answer for the Python software ecosystem as a whole.

If you're working on an operating system component for Fedora, Debian, or any other distro, you actually want to be using the system provided Python runtime, and hence need to be able to readily convert your upstream Python dependencies into policy compliant system dependencies.

Similarly, if you're wanting to support folks that deploy to a preconfigured Python environment in services like AWS Lambda, Azure Cloud Functions, Heroku, OpenShift or Cloud Foundry, or that use alternative Python runtimes like PyPy or MicroPython, then you need a publication technology that doesn't tightly couple your releases to a specific version of the underlying language runtime.

As a result, pip and conda end up existing at slightly different points in the system integration pipeline:

  • Publishing and consuming Python software with pip is a matter of "bring your own Python runtime". This has the benefit that you can readily bring your own runtime (and manage it using whichever tools make sense for your use case), but also has the downside that you must supply your own runtime (which can sometimes prove to be a significant barrier to entry for new Python users, as well as being a pain for cross-platform environment management).
  • Like Linux system package managers before it, conda takes away the requirement to supply your own Python runtime by providing one for you. This is great if you don't have any particular preference as to which runtime you want to use, but if you do need to use a different runtime for some reason, you're likely to end up fighting against the tooling, rather than having it help you. (If you're tempted to answer "Just add another interpreter to the pre-integrated set!" here, keep in mind that doing so without the aid of a runtime independent plugin manager like pip acts as a multiplier on the platform level integration testing needed, which can be a significant cost even when it's automated)

Where do we go next?

In case it isn't already clear from the above, I'm largely happy with the respective niches that pip and conda are carving out for themselves as a plugin manager for Python runtimes and as a cross-platform platform focused on (but not limited to) data analysis use cases.

However, there's still plenty of scope to improve the effectiveness of the collaboration between the upstream Python Packaging Authority and downstream Python redistributors, as well as to reduce barriers to entry for participation in the ecosystem in general, so I'll go over some of the key areas I see for potential improvement.

Sustainability and the bystander effect

It's not a secret that the core PyPA infrastructure (PyPI, pip, twine, setuptools) is nowhere near as well-funded as you might expect given its criticality to the operations of some truly enormous organisations.

The biggest impact of this is that even when volunteers show up ready and willing to work, there may not be anybody in a position to effectively wrangle those volunteers, and help keep them collaborating effectively and moving in a productive direction.

To secure long term sustainability for the core Python packaging infrastructure, we're only talking amounts on the order of a few hundred thousand dollars a year - enough to cover some dedicated operations and publisher support staff for PyPI (freeing up the volunteers currently handling those tasks to help work on ecosystem improvements), as well as to fund targeted development directed at some of the other problems described below.

However, rather than being a true "tragedy of the commons", I personally chalk this situation up to a different human cognitive bias: the bystander effect.

The reason I think that is that we have so many potential sources of the necessary funding that even folks that agree there's a problem that needs to be solved are assuming that someone else will take care of it, without actually checking whether or not that assumption is entirely valid.

The primary responsibility for correcting that oversight falls squarely on the Python Software Foundation, which is why the Packaging Working Group was formed in order to investigate possible sources of additional funding, as well as to determine how any such funding can be spent most effectively.

However, a secondary responsibility also falls on customers and staff of commercial Python redistributors, as this is exactly the kind of ecosystem level risk that commercial redistributors are being paid to manage on behalf of their customers, and they're currently not handling this particular situation very well. Accordingly, anyone that's actually paying for CPython, pip, and related tools (either directly or as a component of a larger offering), and expecting them to be supported properly as a result, really needs to be asking some very pointed question of their suppliers right about now. (Here's a sample question: "We pay you X dollars a year, and the upstream Python ecosystem is one of the things we expect you to support with that revenue. How much of what we pay you goes towards maintenance of the upstream Python packaging infrastructure that we rely on every day?").

One key point to note about the current situation is that as a 501(c)(3) public interest charity, any work the PSF funds will be directed towards better fulfilling that public interest mission, and that means focusing primarily on the needs of educators and non-profit organisations, rather than those of private for-profit entities.

Commercial redistributors are thus far better positioned to properly represent their customers interests in areas where their priorities may diverge from those of the wider community (closing the "insider threat" loophole in PyPI's current security model is a particular case that comes to mind - see Making PyPI security independent of SSL/TLS).

Migrating PyPI to pypi.org

An instance of the new PyPI implementation (Warehouse) is up and running at https://pypi.org/ and connected directly to the production PyPI database, so folks can already explicitly opt-in to using it over the legacy implementation if they prefer to do so.

However, there's still a non-trivial amount of design, development and QA work needed on the new version before all existing traffic can be transparently switched over to using it.

Getting at least this step appropriately funded and a clear project management plan in place is the main current focus of the PSF's Packaging Working Group.

Making the presence of a compiler on end user systems optional

Between the wheel format and the manylinux1 usefully-distro-independent ABI definition, this is largely handled now, with conda available as an option to handle the relatively small number of cases that are still a problem for pip.

The main unsolved problem is to allow projects to properly express the constraints they place on target environments so that issues can be detected at install time or repackaging time, rather than only being detected as runtime failures. Such a feature will also greatly expand the ability to correctly generate platform level dependencies when converting Python projects to downstream package formats like those used by conda and Linux system package managers.

Bootstrapping dependency management tools on end user systems

With pip being bundled with recent versions of CPython (including CPython 2.7 maintenance releases), and pip (or a variant like upip) also being bundled with most other Python runtimes, the ecosystem bootstrapping problem has largely been addressed for new Python users.

There are still a few usability challenges to be addressed (like defaulting to per-user installations when outside a virtual environment, interoperating more effectively with platform component managers like conda, and providing an officially supported installation interface that works at the Python prompt rather than via the operating system command line), but those don't require the same level of political coordination across multiple groups that was needed to establish pip as the lowest common denominator approach to dependency management for Python applications.

Making the use of distutils and setuptools optional

As mentioned above, distutils was designed ~18 years ago as a common interface for Linux distributions to build Python projects, while setuptools was designed ~12 years ago as a plugin management system for an open source Microsoft Exchange replacement. While both projects have given admirable service in their original target niches, and quite a few more besides, their age and original purpose means they're significantly more complex than what a user needs if all they want to do is to publish their pure Python library or framework to the Python Package index.

Their underlying complexity also makes it incredibly difficult to improve the problematic state of their documentation, which is split between the legacy distutils documentation in the CPython standard library and the additional setuptools specific documentation in the setuptools project.

Accordingly, what we want to do is to change the way build toolchains for Python projects are organised to have 3 clearly distinct tiers:

  • toolchains for pure Python projects
  • toolchains for Python projects with simple C extensions
  • toolchains for C/C++/other projects with Python bindings

This allows folks to be introduced to simpler tools like flit first, better enables the development of potential alternatives to setuptools at the second tier, and supports the use of full-fledged pip-installable build systems like Scons and Meson at the third tier.

The first step in this project, defining the pyproject.toml format to allow declarative specification of the dependencies needed to launch setup.py, has been implemented, and Daniel Holth's enscons project demonstrates that that is already sufficient to bootstrap an external build system even without the later stages of the project.

Future steps include providing native support for pyproject.toml in pip and easy_install, as well as defining a declarative approach to invoking the build system rather than having to run setup.py with the relevant distutils & setuptools flags.

Making PyPI security independent of SSL/TLS

PyPI currently relies entirely on SSL/TLS to protect the integrity of the link between software publishers and PyPI, and between PyPI and software consumers. The only protections against insider threats from within the PyPI administration team are ad hoc usage of GPG artifact signing by some projects, personal vetting of new team members by existing team members and 3rd party checks against previously published artifact hashes unexpectedly changing.

A credible design for end-to-end package signing that adequately accounts for the significant usability issues that can arise around publisher and consumer key management has been available for almost 3 years at this point (see Surviving a Compromise of PyPI and Surviving a Compromise of PyPI: the Maximum Security Edition).

However, implementing that solution has been gated not only on being able to first retire the legacy infrastructure, but also the PyPI administators being able to credibly commit to the key management obligations of operating the signing system, as well as to ensuring that the system-as-implemented actually provides the security guarantees of the system-as-designed.

Accordingly, this isn't a project that can realistically be pursued until the underlying sustainability problems have been suitably addressed.

Automating wheel creation

While redistributors will generally take care of converting upstream Python packages into their own preferred formats, the Python-specific wheel format is currently a case where it is left up to publishers to decide whether or not to create them, and if they do decide to create them, how to automate that process.

Having PyPI take care of this process automatically is an obviously desirable feature, but it's also an incredibly expensive one to build and operate.

Thus, it currently makes sense to defer this cost to individual projects, as there are quite a few commercial continuous integration and continuous deployment service providers willing to offer free accounts to open source projects, and these can also be used for the task of producing release artifacts. Projects also remain free to only publish source artifacts, relying on pip's implicit wheel creation and caching and the appropriate use of private PyPI mirrors and caches to meet the needs of end users.

For downstream platform communities already offering shared build infrastructure to their members (such as Linux distributions and conda-forge), it may make sense to offer Python wheel generation as a supported output option for cross-platform development use cases, in addition to the platform's native binary packaging format.

What problem does it solve?

One of the more puzzling aspects of Python for newcomers to the language is the stark usability differences between the standard library's urllib module and the popular (and well-recommended) third party module, requests, when it comes to writing HTTP(S) protocol clients. When your problem is "talk to a HTTP server", the difference in usability isn't immediately obvious, but it becomes clear as soon as additional requirements like SSL/TLS, authentication, redirect handling, session management, and JSON request and response bodies enter the picture.

It's tempting, and entirely understandable, to want to chalk this difference in ease of use up to requests being "Pythonic" (in 2016 terms), while urllib has now become un-Pythonic (despite being included in the standard library).

While there are certainly a few elements of that (e.g. the property builtin was only added in Python 2.2, while urllib2 was included in the original Python 2.0 release and hence couldn't take that into account in its API design), the vast majority of the usability difference relates to an entirely different question we often forget to ask about the software we use: What problem does it solve?

That is, many otherwise surprising discrepancies between urllib/urllib2 and requests are best explained by the fact that they solve different problems, and the problems most HTTP client developers have today are closer to those Kenneth Reitz designed requests to solve in 2010/2011, than they are to the problems that Jeremy Hylton was aiming to solve more than a decade earlier.

It's all in the name

To quote the current Python 3 urllib package documentation: "urllib is a package that collects several modules for working with URLs".

And the docstring from Jeremy's original commit message adding urllib2 to CPython: "An extensible library for opening URLs using a variety [of] protocols".

Wait, what? We're just trying to write a HTTP client, so why is the documentation talking about working with URLs in general?

While it may seem strange to developers accustomed to the modern HTTPS+JSON powered interactive web, it wasn't always clear that that was how things were going to turn out.

At the turn of the century, the expectation was instead that we'd retain a rich variety of data transfer protocols with different characteristics optimised for different purposes, and that the most useful client to have in the standard library would be one that could be used to talk to multiple different kinds of servers (like HTTP, FTP, NFS, etc), without client developers needing to worry too much about the specific protocol used (as indicated by the URL schema).

In practice, things didn't work out that way (mostly due to restrictive institutional firewalls meaning HTTP servers were the only remote services that could be accessed reliably), so folks in 2016 are now regularly comparing the usability of a dedicated HTTP(S)-only client library with a general purpose URL handling library that needs to be configured to specifically be using HTTP(S) before you gain access to most HTTP(S) features.

When it was written, urllib2 was a square peg that was designed to fit into the square hole of "generic URL processing". By contrast, most modern client developers are looking for a round peg to fit into the round hole that is HTTPS+JSON processing - urllib/urllib2 will fit if you shave the corners off first, but requests comes pre-rounded.

So why not add requests to the standard library?

Answering the not-so-obvious question of "What problem does it solve?" then leads to a more obvious follow-up question: if the problems that urllib/ urllib2 were designed to solve are no longer common, while the problems that requests solves are common, why not add requests to the standard library?

If I recall correctly, Guido gave in-principle approval to this idea at a language summit back in 2013 or so (after the requests 1.0 release), and it's a fairly common assumption amongst the core development team that either requests itself (perhaps as a bundled snapshot of an independently upgradable component) or a compatible subset of the API with a different implementation will eventually end up in the standard library.

However, even putting aside the misgivings of the requests developers about the idea, there are still some non-trivial system integration problems to solve in getting requests to a point where it would be acceptable as a standard library component.

In particular, one of the things that requests does to more reliably handle SSL/TLS certificates in a cross-platform way is to bundle the Mozilla Certificate Bundle included in the certifi project. This is a sensible thing to do by default (due to the difficulties of obtaining reliable access to system security certificates in a cross-platform way), but it conflicts with the security policy of the standard library, which specifically aims to delegate certificate management to the underlying operating system. That policy aims to address two needs: allowing Python applications access to custom institutional certificates added to the system certificate store (most notably, private CA certificates for large organisations), and avoiding adding an additional certificate store to end user systems that needs to be updated when the root certificate bundle changes for any other reason.

These kinds of problems are technically solvable, but they're not fun to solve, and the folks in a position to help solve them already have a great many other demands on their time.This means we're not likely to see much in the way of progress in this area as long as most of the CPython and requests developers are pursuing their upstream contributions as a spare time activity, rather than as something they're specifically employed to do.

Propose a talk for the PyCon Australia Education Seminar!

The pitch

Involved in Australian education, whether formally or informally? Making use of Python in your classes, workshops or other activities? Interested in sharing your efforts with other Australian educators, and with the developers that create the tools you use? Able to get to the Melbourne Convention & Exhibition Centre on Friday August 12th, 2016?

Then please consider submitting a proposal to speak at the Python in Australian Education seminar at PyCon Australia 2016! More information about the seminar can be found here, while details of the submission process are on the main Call for Proposals page.

Submissions close on Sunday May 8th, but may be edited further after submission (including during the proposal review process based on feedback from reviewers).

PyCon Australia is a community-run conference, so everyone involved is a volunteer (organisers, reviewers, and speakers alike), but accepted speakers are eligible for discounted (or even free) registration, and assistance with other costs is also available to help ensure the conference doesn't miss out on excellent presentations due to financial need (for teachers needing to persuade skeptical school administrators, this assistance may extend to contributing towards the costs of engaging a substitute teacher for the day).

The background

At PyCon Australia 2014, James Curran presented an excellent keynote on "Python for Every Child in Australia", covering some of the history of the National Computer Science School, the development of Australia's National Digital Curriculum (finally approved in September 2015), and the opportunity this represented to introduce the next generation of students to computational thinking in general, and Python in particular.

Encouraged by both Dr Curran's keynote at PyCon Australia, and Professor Lorena Barba's "If There's Computational Thinking, There's Computational Learning" keynote at SciPy 2014, it was my honour and privilege in 2015 not only to invite Carrie Anne Philbin, Education Pioneer at the UK's Raspberry Pi Foundation, to speak at the main conference (on "Designed for Education: a Python Solution"), but also to invite her to keynote the inaugural Python in Australian Education seminar. With the support of the Python Software Foundation and Code Club Australia, Carrie Anne joined QSITE's Peter Whitehouse, Code Club Australia's Kelly Tagalan, and several other local educators, authors and community workshop organisers to present an informative, inspirational and sometimes challenging series of talks.

For 2016, we have a new location in Melbourne (PyCon Australia has a two year rotation in each city, and the Education seminar was launched during the second year in Brisbane), a new co-organiser (Katie Bell of Grok Learning and the National Computer Science School), and a Call for Proposals and financial assistance program that are fully integrated with those for the main conference.

As with the main conference, however, the Python in Australian Education seminar is designed around the idea of real world practitioners sharing information with each other about their day to day experiences, what has worked well for them, and what hasn't, and creating personal connections that can help facilitate additional collaboration throughout the year.

So, in addition to encouraging people to submit their own proposals, I'd also encourage folks to talk to their friends and peers that they'd like to see presenting, and see if they're interested in participating.

27 languages to improve your Python

As a co-designer of one of the world's most popular programming languages, one of the more frustrating behaviours I regularly see (both in the Python community and in others) is influential people trying to tap into fears of "losing" to other open source communities as a motivating force for community contributions. (I'm occasionally guilty of this misbehaviour myself, which makes it even easier to spot when others are falling into the same trap).

While learning from the experiences of other programming language communities is a good thing, fear based approaches to motivating action are seriously problematic, as they encourage community members to see members of those other communities as enemies in a competition for contributor attention, rather than as potential allies in the larger challenge of advancing the state of the art in software development. It also has the effect of telling folks that enjoy those other languages that they're not welcome in a community that views them and their peers as "hostile competitors".

In truth, we want there to be a rich smorgasboard of cross platform open source programming languages to choose from, as programming languages are first and foremost tools for thinking - they make it possible for us to convey our ideas in terms so explicit that even a computer can understand them. If someone has found a language to use that fits their brain and solves their immediate problems, that's great, regardless of the specific language (or languages) they choose.

So I have three specific requests for the Python community, and one broader suggestion. First, the specific requests:

  1. If we find it necessary to appeal to tribal instincts to motivate action, we should avoid using tribal fear, and instead aim to use tribal pride. When we use fear as a motivator, as in phrasings like "If we don't do X, we're going to lose developer mindshare to language Y", we're deliberately creating negative emotions in folks freely contributing the results of their work to the world at large. Relying on tribal pride instead leads to phrasings like "It's currently really unclear how to solve problem X in Python. If we look to ecosystem Y, we can see they have a really nice approach to solving problem X that we can potentially adapt to provide a similarly nice user experience in Python". Actively emphasising taking pride in our own efforts, rather than denigrating the efforts of others, helps promote a culture of continuous learning within the Python community and also encourages the development of ever improving collaborative relationships with other communities.
  2. Refrain from adopting attitudes of contempt towards other open source programming language communities, especially if those communities have empowered people to solve their own problems rather than having to wait for commercial software vendors to deign to address them. Most of the important problems in the world aren't profitable to solve (as the folks afflicted by them aren't personally wealthy and don't control institutional funding decisions), so we should be encouraging and applauding the folks stepping up to try to solve them, regardless of what we may think of their technology choices.
  3. If someone we know is learning to program for the first time, and they choose to learn a language we don't personally like, we should support them in their choice anyway. They know what fits their brain better than we do, so the right language for us may not be the right language for them. If they start getting frustrated with their original choice, to the point where it's demotivating them from learning to program at all, then it makes sense to start recommending alternatives. This advice applies even for those of us involved in improving the tragically bad state of network security: the way we solve the problem with inherently insecure languages is by improving operating system sandboxing capabilities, progressively knocking down barriers to adoption for languages with better native security properties, and improving the default behaviours of existing languages, not by confusing beginners with arguments about why their chosen language is a poor choice from an application security perspective. (If folks are deploying unaudited software written by beginners to handle security sensitive tasks, it isn't the folks writing the software that are the problem, it's the folks deploying it without performing appropriate due diligence on the provenance and security properties of that software)

My broader suggestion is aimed at folks that are starting to encounter the limits of the core procedural subset of Python and would hence like to start exploring more of Python's own available "tools for thinking".

Broadening our horizons

One of the things we do as part of the Python core development process is to look at features we appreciate having available in other languages we have experience with, and see whether or not there is a way to adapt them to be useful in making Python code easier to both read and write. This means that learning another programming language that focuses more specifically on a given style of software development can help improve anyone's understanding of that style of programming in the context of Python.

To aid in such efforts, I've provided a list below of some possible areas for exploration, and other languages which may provide additional insight into those areas. Where possible, I've linked to Wikipedia pages rather than directly to the relevant home pages, as Wikipedia often provides interesting historical context that's worth exploring when picking up a new programming language as an educational exercise rather than for immediate practical use.

While I do know many of these languages personally (and have used several of them in developing production systems), the full list of recommendations includes additional languages that I only know indirectly (usually by either reading tutorials and design documentation, or by talking to folks that I trust to provide good insight into a language's strengths and weaknesses).

There are a lot of other languages that could have gone on this list, so the specific ones listed are a somewhat arbitrary subset based on my own interests (for example, I'm mainly interested in the dominant Linux, Android and Windows ecosystems, so I left out the niche-but-profitable Apple-centric Objective-C and Swift programming languages, and I'm not familiar enough with art-focused environments like Processing to even guess at what learning them might teach a Python developer). For a more complete list that takes into account factors beyond what a language might teach you as a developer, IEEE Spectrum's annual ranking of programming language popularity and growth is well worth a look.

Procedural programming: C, Rust, Cython

Python's default execution model is procedural: we start at the top of the main module and execute it statement by statement. All of Python's support for the other approaches to data and computational modelling covered below is built on this procedural foundation.

The C programming language is still the unchallenged ruler of low level procedural programming. It's the core implementation language for the reference Python interpreter, and also for the Linux operating system kernel. As a software developer, learning C is one of the best ways to start learning more about the underlying hardware that executes software applications - C is often described as "portable assembly language", and one of the first applications cross-compiled for any new CPU architecture will be a C compiler

Rust, by contrast, is a relatively new programming language created by Mozilla. The reason it makes this list is because Rust aims to take all of the lessons we've learned as an industry regarding what not to do in C, and design a new language that is interoperable with C libraries, offers the same precise control over hardware usage that is needed in a low level systems programming language, but uses a different compile time approach to data modelling and memory management to structurally eliminate many of the common flaws afflicting C programs (such as buffer overflows, double free errors, null pointer access, and thread synchronisation problems). I'm an embedded systems engineer by training and initial professional experience, and Rust is the first new language I've seen that looks like it may have the potential to scale down to all of the niches currently dominated by C and custom assembly code.

Cython is also a lower level procedural-by-default language, but unlike general purpose languages like C and Rust, Cython is aimed specifically at writing CPython extension modules. To support that goal, Cython is designed as a Python superset, allowing the programmer to choose when to favour the pure Python syntax for flexibility, and when to favour Cython's syntax extensions that make it possible to generate code that is equivalent to native C code in terms of speed and memory efficiency.

Learning one of these languages is likely to provide insight into memory management, algorithmic efficiency, binary interface compatibility, software portability, and other practical aspects of turning source code into running systems.

Object-oriented data modelling: Java, C#, Eiffel

One of the main things we need to do in programming is to model the state of the real world, and offering native syntactic support for object-oriented programming is one of the most popular approaches for doing that: structurally grouping data structures, and methods for operating on those data structures into classes.

Python itself is deliberately designed so that it is possible to use the object-oriented features without first needing to learn to write your own classes. Not every language adopts that approach - those listed in this section are ones that consider learning object-oriented design to be a requirement for using the language at all.

After a major marketing push by Sun Microsystems in the mid-to-late 1990's, Java became the default language for teaching introductory computer science in many tertiary institutions. While it is now being displaced by Python for many educational use cases, it remains one of the most popular languages for the development of business applications. There are a range of other languages that target the common JVM (Java Virtual Machine) runtime, including the Jython implementation of Python. The Dalvik and ART environments for Android systems are based on a reimplementation of the Java programming APIs.

C# is similar in many ways to Java, and emerged as an alternative after Sun and Microsoft failed to work out their business differences around Microsoft's Java implementation, J++. Like Java, it's a popular language for the development of business applications, and there are a range of other languages that target the shared .NET CLR (Common Language Runtime), including the IronPython implementation of Python (the core components of the original IronPython 1.0 implementation were extracted to create the language neutral .NET Dynamic Language Runtime). For a long time, .NET was a proprietary Windows specific technology, with mono as a cross-platform open source reimplementation, but Microsoft shifted to an open source ecosystem strategy in early 2015.

Unlike most of the languages in this list, Eiffel isn't one I'd recommend for practical day-to-day use. Rather, it's one I recommend because learning it taught me an incredible amount about good object-oriented design where "verifiably correct" is a design goal for the application. (Learning Eiffel also taught me a lot about why "verifiably correct" isn't actually a design goal in most software development, as verifiably correct software really doesn't cope well with ambiguity and is entirely unsuitable for cases where you genuinely don't know the relevant constraints yet and need to leave yourself enough wiggle room to be able to figure out the finer details through iterative development).

Learning one of these languages is likely to provide insight into inheritance models, design-by-contract, class invariants, pre-conditions, post-conditions, covariance, contravariance, method resolution order, generic programming, and various other notions that also apply to Python's type system. There are also a number of standard library modules and third party frameworks that use this "visibly object-oriented" design style, such as the unittest and logging modules, and class-based views in the Django web framework.

Object-oriented C derivatives: C++, D

One way of using the CPython runtime is as a "C with objects" programming environment - at its core, CPython is implemented using C's approach to object-oriented programming, which is to define C structs to hold the data of interest, and to pass in instances of the struct as the first argument to functions that then manipulate that data (these are the omnipresent PyObject* pointers in the CPython C API). This design pattern is deliberately mirrored at the Python level in the form of the explicit self and cls arguments to instance methods and class methods.

C++ is a programming language that aimed to retain full source compatibility with C, while adding higher level features like native object-oriented programming support and template based metaprogramming. It's notoriously verbose and hard to program in (although the 2011 update to the language standard addressed many of the worst problems), but it's also the language of choice in many contexts, including 3D modelling graphics engines and cross-platform application development frameworks like Qt.

The D programming language is also interesting, as it has a similar relationship to C++ as Rust has to C: it aims to keep most of the desirable characteristics of C++, while also avoiding many of its problems (like the lack of memory safety). Unlike Rust, D was not a ground up design of a new programming language from scratch - instead, D is a close derivative of C++, and while it isn't a strict C superset as C++ is, it does follow the design principle that any code that falls into the common subset of C and D must behave the same way in both languages.

Learning one of these languages is likely to provide insight into the complexities of combining higher level language features with the underlying C runtime model. Learning C++ is also likely to be useful when using Python to manipulate existing libraries and toolkits written in C++.

Array-oriented data processing: MATLAB/Octave, Julia

Array oriented programming is designed to support numerical programming models: those based on matrix algebra and related numerical methods.

While Python's standard library doesn't support this directly, array oriented programming is taken into account in the language design, with a range of syntactic and semantic features being added specifically for the benefit of the third party NumPy library and similarly array-oriented tools.

In many cases, the Scientific Python stack is adopted as an alternative to the proprietary MATLAB programming environment, which is used extensively for modelling, simulation and numerical data analysis in science and engineering. GNU Octave is an open source alternative that aims to be syntactically compatible with MATLAB code, allowing folks to compare and contrast the two approaches to array-oriented programming.

Julia is another relatively new language, which focuses heavily on array oriented programming and type-based function overloading.

Learning one of these languages is likely to provide insight into the capabilities of the Scientific Python stack, as well as providing opportunities to explore hardware level parallel execution through technologies like OpenCL and Nvidia's CUDA, and distributed data processing through ecosystems like Apache Spark and the Python-specific Blaze.

Statistical data analysis: R

As access to large data sets has grown, so has demand for capable freely available analytical tools for processing those data sets. One such tool is the R programming language, which focuses specifically on statistical data analysis and visualisation.

Learning R is likely to provide insight into the statistical analysis capabilities of the Scientific Python stack, especially the pandas data manipulation library and the seaborn statistical visualisation library.

Computational pipeline modelling: Haskell, Scala, Clojure, F#

Object-oriented data modelling and array-oriented data processing focus a lot of attention on modelling data at rest, either in the form of collections of named attributes or as arrays of structured data.

By contrast, functional programming languages emphasise the modelling of data in motion, in the form of computational flows. Learning at least the basics of functional programming can help greatly improve the structure of data transformation operations even in otherwise procedural, object-oriented or array-oriented applications.

Haskell is a functional programming language that has had a significant influence on the design of Python, most notably through the introduction of list comprehensions in Python 2.0.

Scala is an (arguably) functional programming language for the JVM that, together with Java, Python and R, is one of the four primary programming languages for the Apache Spark data analysis platform. While being designed to encourage functional programming approaches, Scala's syntax, data model, and execution model are also designed to minimise barriers to adoption for current Java programmers (hence the "arguably" - the case can be made that Scala is better categorised as an object-oriented programming language with strong functional programming support).

Clojure is another functional programming language for the JVM that is designed as a dialect of Lisp. It earns its place in this list by being the inspiration for the toolz functional programming toolkit for Python.

F# isn't a language I'm particularly familiar with myself, but seems worth noting as the preferred functional programming language for the .NET CLR.

Learning one of these languages is likely to provide insight into Python's own computational pipeline modelling tools, including container comprehensions, generators, generator expressions, the functools and itertools standard library modules, and third party functional Python toolkits like toolz.

Event driven programming: JavaScript, Go, Erlang, Elixir

Computational pipelines are an excellent way to handle data transformation and analysis problems, but many problems require that an application run as a persistent service that waits for events to occur, and then handles those events. In these kinds of services, it is usually essential to be able to handle multiple events concurrently in order to be able to accommodate multiple users (or at least multiple actions) at the same time.

JavaScript was originally developed as an event handling language for web browsers, permitting website developers to respond locally to client side actions (such as mouse clicks and key presses) and events (such as the page rendering being completed). It is supported in all modern browsers, and together with the HTML5 Domain Object Model, has become a de facto standard for defining the appearance and behaviour of user interfaces.

Go was designed by Google as a purpose built language for creating highly scalable web services, and has also proven to be a very capable language for developing command line applications. The most interesting aspect of Go from a programming language design perspective is its use of Communicating Sequential Processes concepts in its core concurrency model.

Erlang was designed by Ericsson as a purpose built language for creating highly reliable telephony switches and similar devices, and is the language powering the popular RabbitMQ message broker. Erlang uses the Actor model as its core concurrency primitive, passing messages between threads of execution, rather than allowing them to share data directly. While I've never programmed in Erlang myself, my first full-time job involved working with (and on) an Actor-based concurrency framework for C++ developed by an ex-Ericsson engineer, as well as developing such a framework myself based on the TSK (Task) and MBX (Mailbox) primitives in Texas Instrument's lightweight DSP/BIOS runtime (now known as TI-RTOS).

Elixir earns an entry on the list by being a language designed to run on the Erlang VM that exposes the same concurrency semantics as Erlang, while also providing a range of additional language level features to help provide a more well-rounded environment that is more likely to appeal to developers migrating from other languages like Python, Java, or Ruby.

Learning one of these languages is likely to provide insight into Python's own concurrency and parallelism support, including native coroutines, generator based coroutines, the concurrent.futures and asyncio standard library modules, third party network service development frameworks like Twisted and Tornado, the channels concept being introduced to Django, and the event handling loops in GUI frameworks.

Gradual typing: TypeScript

One of the more controversial features that landed in Python 3.5 was the new typing module, which brings a standard lexicon for gradual typing support to the Python ecosystem.

For folks whose primary exposure to static typing is in languages like C, C++ and Java, this seems like an astoundingly terrible idea (hence the controversy).

Microsoft's TypeScript, which provides gradual typing for JavaScript applications provides a better illustration of the concept. TypeScript code compiles to JavaScript code (which then doesn't include any runtime type checking), and TypeScript annotations for popular JavaScript libraries are maintained in the dedicated DefinitelyTyped repository.

As Chris Neugebauer pointed out in his PyCon Australia presentation, this is very similar to the proposed relationship between Python, the typeshed type hint repository, and type inference and analysis tools like mypy.

In essence, bothTypeScript and type hinting in Python are ways of writing particular kinds of tests, either as separate files (just like normal tests), or inline with the main body of the code (just like type declarations in statically typed languages). In either case, you run a separate command to actually check that the rest of the code is consistent with the available type assertions (this occurs implicitly as part of the compilation to JavaScript for TypeScript, and as an entirely optional static analysis task for Python's type hinting).

Dynamic metaprogramming: Hy, Ruby

A feature folks coming to Python from languages like C, C++, C# and Java often find disconcerting is the notion that "code is data": the fact that things like functions and classes are runtime objects that can be manipulated like any other object.

Hy is a Lisp dialect that runs on both the CPython VM and the PyPy VM. Lisp dialects take the "code as data" concept to extremes, as Lisp code consists of nested lists describing the operations to be performed (the name of the language itself stands for "LISt Processor"). The great strength of Lisp-style languages is that they make it incredibly easy to write your own domain specific languages. The great weakness of Lisp-style languages is that they make it incredibly easy to write your own domain specific languages, which can sometimes make it difficult to read other people's code.

Ruby is a language that is similar to Python in many respects, but as a community is far more open to making use of dynamic metaprogramming features that are "supported, but not encouraged" in Python. This includes things like reopening class definitions to add additional methods, and using closures to implement core language constructs like iteration.

Learning one of these languages is likely to provide insight into Python's own dynamic metaprogramming support, including function and class decorators, monkeypatching, the unittest.mock standard library module, and third party object proxying modules like wrapt. (I'm not aware of any languages to learn that are likely to provide insight into Python's metaclass system, so if anyone has any suggestions on that front, please mention them in the comments. Metaclasses power features like the core type system, abstract base classes, enumeration types and runtime evaluation of gradual typing expressions)

Pragmatic problem solving: Lua, PHP, Perl

Popular programming languages don't exist in isolation - they exist as part of larger ecosystems of redistributors (both commercial and community focused), end users, framework developers, tool developers, educators and more.

Lua is a popular programming language for embedding in larger applications as a scripting engine. Significant examples include it being the language used to write add-ons for the World of Warcraft game client, and it's also embedded in the RPM utility used by many Linux distributions. Compared to CPython, a Lua runtime will generally be a tenth of the size, and it's weaker introspection capabilities generally make it easier to isolate from the rest of the application and the host operating system. A notable contribution from the Lua community to the Python ecosystem is the adoption of the LuaJIT FFI (Foreign Function Interface) as the basis of the JIT-friendly cffi interface library for CPython and PyPy.

PHP is another popular programming language that rose to prominence as the original "P" in the Linux-Apache-MySQL-PHP LAMP stack, due to its focus on producing HTML pages, and its broad availability on early Virtual Private Server hosting providers. For all the handwringing about conceptual flaws in various aspects of its design, it's now the basis of several widely popular open source web services, including the Drupal content management system, the Wordpress blogging engine, and the MediaWiki engine that powers Wikipedia. PHP also powers important services like the Ushahidi platform for crowdsourced community reporting on distributed events.

Like PHP, Perl rose to popularity on the back of Linux. Unlike PHP, which grew specifically as a web development platform, Perl rose to prominence as a system administrator's tool, using regular expressions to string together and manipulate the output of text-based Linux operating system commands. When sh, awk and sed were no longer up to handling a task, Perl was there to take over.

Learning one of these languages isn't likely to provide any great insight into aesthetically beautiful or conceptually elegant programming language design. What it is likely to do is to provide some insight into how programming language distribution and adoption works in practice, and how much that has to do with fortuitous opportunities, accidents of history and lowering barriers to adoption by working with redistributors to be made available by default, rather than the inherent capabilities of the languages themselves.

In particular, it may provide insight into the significance of projects like CKAN, OpenStack NFV, Blender, SciPy, OpenMDAO, PyGMO, PyCUDA, the Raspberry Pi Foundation and Python's adoption by a wide range of commercial organisations, for securing ongoing institutional investment in the Python ecosystem.

TCP echo client and server in Python 3.5

This is a follow-on from my previous post on Python 3.5's new async/await syntax. Rather than the simple background timers used in the original post, this one will look at the impact native coroutine support has on the TCP echo client and server examples from the asyncio documentation.

First, we'll recreate the run_in_foreground helper defined in the previous post. This helper function makes it easier to work with coroutines from otherwise synchronous code (like the interactive prompt):

def run_in_foreground(task, *, loop=None):
    """Runs event loop in current thread until the given task completes

    Returns the result of the task.
    For more complex conditions, combine with asyncio.wait()
    To include a timeout, combine with asyncio.wait_for()
    """
    if loop is None:
        loop = asyncio.get_event_loop()
    return loop.run_until_complete(asyncio.ensure_future(task, loop=loop))

Next we'll define the coroutine for our TCP echo server implementation, which simply waits to receive up to 100 bytes on each new client connection, and then sends that data back to the client:

async def handle_tcp_echo(reader, writer):
    data = await reader.read(100)
    message = data.decode()
    addr = writer.get_extra_info('peername')
    print("-> Server received %r from %r" % (message, addr))
    print("<- Server sending: %r" % message)
    writer.write(data)
    await writer.drain()
    print("-- Terminating connection on server")
    writer.close()

And then the client coroutine we'll use to send a message and wait for a response:

async def tcp_echo_client(message, port, loop=None):
    reader, writer = await asyncio.open_connection('127.0.0.1', port,
                                                        loop=loop)
    print('-> Client sending: %r' % message)
    writer.write(message.encode())
    data = (await reader.read(100)).decode()
    print('<- Client received: %r' % data)
    print('-- Terminating connection on client')
    writer.close()
    return data

We then use our run_in_foreground helper to interact with these coroutines from the interactive prompt. First, we start the echo server:

>>> make_server = asyncio.start_server(handle_tcp_echo, '127.0.0.1')
>>> server = run_in_foreground(make_server)

Conveniently, since this is a coroutine running in the current thread, rather than in a different thread, we can retrieve the details of the listening socket immediately, including the automatically assigned port number:

>>> server.sockets[0]
<socket.socket fd=6, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1', 40796)>
>>> port = server.sockets[0].getsockname()[1]

Since we haven't needed to hardcode the port number, if we want to define a second server, we can easily do that as well:

>>> make_server2 = asyncio.start_server(handle_tcp_echo, '127.0.0.1')
>>> server2 = run_in_foreground(make_server2)
>>> server2.sockets[0]
<socket.socket fd=7, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1', 41200)>
>>> port2 = server2.sockets[0].getsockname()[1]

Now, both of these servers are configured to run directly in the main thread's event loop, so trying to talk to them using a synchronous client wouldn't work. The client would block the main thread, and the servers wouldn't be able to process incoming connections. That's where our asynchronous client coroutine comes in: if we use that to send messages to the server, then it doesn't block the main thread either, and both the client and server coroutines can process incoming events of interest. That gives the following results:

>>> print(run_in_foreground(tcp_echo_client('Hello World!', port)))
-> Client sending: 'Hello World!'
-> Server received 'Hello World!' from ('127.0.0.1', 44386)
<- Server sending: 'Hello World!'
-- Terminating connection on server
<- Client received: 'Hello World!'
-- Terminating connection on client
Hello World!

Note something important here: you will get exactly that sequence of output messages, as this is all running in the interpreter's main thread, in a deterministic order. If the servers were running in their own threads, we wouldn't have that property (and reliably getting access to the port numbers the server components were assigned by the underlying operating system would also have been far more difficult).

And to demonstrate both servers are up and running:

>>> print(run_in_foreground(tcp_echo_client('Hello World!', port2)))
-> Client sending: 'Hello World!'
-> Server received 'Hello World!' from ('127.0.0.1', 44419)
<- Server sending: 'Hello World!'
-- Terminating connection on server
<- Client received: 'Hello World!'
-- Terminating connection on client
Hello World!

That then raises an interesting question: how would we send messages to the two servers in parallel, while still only using a single thread to manage the client and server coroutines? For that, we'll need another of our helper functions from the previous post, schedule_coroutine:

def schedule_coroutine(target, *, loop=None):
    """Schedules target coroutine in the given event loop

    If not given, *loop* defaults to the current thread's event loop

    Returns the scheduled task.
    """
    if asyncio.iscoroutine(target):
        return asyncio.ensure_future(target, loop=loop)
    raise TypeError("target must be a coroutine, "
                    "not {!r}".format(type(target)))

Update: As with the previous post, this post originally suggested a combined "run_in_background" helper function that handled both scheduling coroutines and calling arbitrary callables in a background thread or process. On further reflection, I decided that was unhelpfully conflating two different concepts, so I replaced it with separate "schedule_coroutine" and "call_in_background" helpers

First, we set up the two client operations we want to run in parallel:

>>> echo1 = schedule_coroutine(tcp_echo_client('Hello World!', port))
>>> echo2 = schedule_coroutine(tcp_echo_client('Hello World!', port2))

Then we use the asyncio.wait function in combination with run_in_foreground to run the event loop until both operations are complete:

>>> run_in_foreground(asyncio.wait([echo1, echo2]))
-> Client sending: 'Hello World!'
-> Client sending: 'Hello World!'
-> Server received 'Hello World!' from ('127.0.0.1', 44461)
<- Server sending: 'Hello World!'
-- Terminating connection on server
-> Server received 'Hello World!' from ('127.0.0.1', 44462)
<- Server sending: 'Hello World!'
-- Terminating connection on server
<- Client received: 'Hello World!'
-- Terminating connection on client
<- Client received: 'Hello World!'
-- Terminating connection on client
({<Task finished coro=<tcp_echo_client() done, defined at <stdin>:1> result='Hello World!'>, <Task finished coro=<tcp_echo_client() done, defined at <stdin>:1> result='Hello World!'>}, set())

And finally, we retrieve our results using the result method of the task objects returned by schedule_coroutine:

>>> echo1.result()
'Hello World!'
>>> echo2.result()
'Hello World!'

We can set up as many concurrent background tasks as we like, and then use asyncio.wait as the foreground task to wait for them all to complete.

But what if we had an existing blocking client function that we wanted or needed to use (e.g. we're using an asyncio server to test a synchronous client API). To handle that case, we use our third helper function from the previous post:

def call_in_background(target, *, loop=None, executor=None):
    """Schedules and starts target callable as a background task

    If not given, *loop* defaults to the current thread's event loop
    If not given, *executor* defaults to the loop's default executor

    Returns the scheduled task.
    """
    if loop is None:
        loop = asyncio.get_event_loop()
    if callable(target):
        return loop.run_in_executor(executor, target)
    raise TypeError("target must be a callable, "
                    "not {!r}".format(type(target)))

To explore this, we'll need a blocking client, which we can build based on Python's existing socket programming HOWTO guide:

import socket
def tcp_echo_client_sync(message, port):
    conn = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    print('-> Client connecting to port: %r' % port)
    conn.connect(('127.0.0.1', port))
    print('-> Client sending: %r' % message)
    conn.send(message.encode())
    data = conn.recv(100).decode()
    print('<- Client received: %r' % data)
    print('-- Terminating connection on client')
    conn.close()
    return data

We can then use functools.partial in combination with call_in_background to start client requests in multiple operating system level threads:

>>> query_server = partial(tcp_echo_client_sync, "Hello World!", port)
>>> query_server2 = partial(tcp_echo_client_sync, "Hello World!", port2)
>>> bg_call = call_in_background(query_server)
-> Client connecting to port: 35876
-> Client sending: 'Hello World!'
>>> bg_call2 = call_in_background(query_server2)
-> Client connecting to port: 41672
-> Client sending: 'Hello World!'

Here we see that, unlike our coroutine clients, the synchronous clients have started running immediately in a separate thread. However, because the event loop isn't currently running in the main thread, they've blocked waiting for a response from the TCP echo servers. As with the coroutine clients, we address that by running the event loop in the main thread until our clients have both received responses:

>>> run_in_foreground(asyncio.wait([bg_call, bg_call2]))
-> Server received 'Hello World!' from ('127.0.0.1', 52585)
<- Server sending: 'Hello World!'
-- Terminating connection on server
-> Server received 'Hello World!' from ('127.0.0.1', 34399)
<- Server sending: 'Hello World!'
<- Client received: 'Hello World!'
-- Terminating connection on server
-- Terminating connection on client
<- Client received: 'Hello World!'
-- Terminating connection on client
({<Future finished result='Hello World!'>, <Future finished result='Hello World!'>}, set())
>>> bg_call.result()
'Hello World!'
>>> bg_call2.result()
'Hello World!'

Background tasks in Python 3.5

One of the recurring questions with asyncio is "How do I execute one or two operations asynchronously in an otherwise synchronous application?"

Say, for example, I have the following code:

>>> import itertools, time
>>> def ticker():
...     for i in itertools.count():
...         print(i)
...         time.sleep(1)
...
>>> ticker()
0
1
2
3
^CTraceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "<stdin>", line 4, in ticker
KeyboardInterrupt

With the native coroutine syntax coming in Python 3.5, I can change that synchronous code into event-driven asynchronous code easily enough:

import asyncio, itertools
async def ticker():
    for i in itertools.count():
        print(i)
        await asyncio.sleep(1)

But how do I arrange for that ticker to start running in the background? What's the coroutine equivalent of appending & to a shell command?

It turns out it looks something like this:

import asyncio
def schedule_coroutine(target, *, loop=None):
    """Schedules target coroutine in the given event loop

    If not given, *loop* defaults to the current thread's event loop

    Returns the scheduled task.
    """
    if asyncio.iscoroutine(target):
        return asyncio.ensure_future(target, loop=loop)
    raise TypeError("target must be a coroutine, "
                    "not {!r}".format(type(target)))

Update: This post originally suggested a combined "run_in_background" helper function that handle both scheduling coroutines and calling arbitrary callables in a background thread or process. On further reflection, I decided that was unhelpfully conflating two different concepts, so I replaced it with separate "schedule_coroutine" and "call_in_background" helpers

So now I can do:

>>> import itertools
>>> async def ticker():
...     for i in itertools.count():
...         print(i)
...         await asyncio.sleep(1)
...
>>> ticker1 = schedule_coroutine(ticker())
>>> ticker1
<Task pending coro=<ticker() running at <stdin>:1>>

But how do I run that for a while? The event loop won't run unless the current thread starts it running and either stops when a particular event occurs, or when explicitly stopped. Another helper function covers that:

def run_in_foreground(task, *, loop=None):
    """Runs event loop in current thread until the given task completes

    Returns the result of the task.
    For more complex conditions, combine with asyncio.wait()
    To include a timeout, combine with asyncio.wait_for()
    """
    if loop is None:
        loop = asyncio.get_event_loop()
    return loop.run_until_complete(asyncio.ensure_future(task, loop=loop))

And then I can do:

>>> run_in_foreground(asyncio.sleep(5))
0
1
2
3
4

Here we can see the background task running while we wait for the foreground task to complete. And if I do it again with a different timeout:

>>> run_in_foreground(asyncio.sleep(3))
5
6
7

We see that the background task picked up again right where it left off the first time.

We can also single step the event loop with a zero second sleep (the ticks reflect the fact there was more than a second delay between running each command):

>>> run_in_foreground(asyncio.sleep(0))
8
>>> run_in_foreground(asyncio.sleep(0))
9

And start a second ticker to run concurrently with the first one:

>>> ticker2 = schedule_coroutine(ticker())
>>> ticker2
<Task pending coro=<ticker() running at <stdin>:1>>
>>> run_in_foreground(asyncio.sleep(0))
0
10

The asynchronous tickers will happily hang around in the background, ready to resume operation whenever I give them the opportunity. If I decide I want to stop one of them, I can cancel the corresponding task:

>>> ticker1.cancel()
True
>>> run_in_foreground(asyncio.sleep(0))
1
>>> ticker2.cancel()
True
>>> run_in_foreground(asyncio.sleep(0))

But what about our original synchronous ticker? Can I run that as a background task? It turns out I can, with the aid of another helper function:

def call_in_background(target, *, loop=None, executor=None):
    """Schedules and starts target callable as a background task

    If not given, *loop* defaults to the current thread's event loop
    If not given, *executor* defaults to the loop's default executor

    Returns the scheduled task.
    """
    if loop is None:
        loop = asyncio.get_event_loop()
    if callable(target):
        return loop.run_in_executor(executor, target)
    raise TypeError("target must be a callable, "
                    "not {!r}".format(type(target)))

However, I haven't figured out how to reliably cancel a task running in a separate thread or process, so for demonstration purposes, we'll define a variant of the synchronous version that stops automatically after 5 ticks rather than ticking indefinitely:

import itertools, time
def tick_5_sync():
    for i in range(5):
        print(i)
        time.sleep(1)
    print("Finishing")

The key difference between scheduling a callable in a background thread and scheduling a coroutine in the current thread, is that the callable will start executing immediately, rather than waiting for the current thread to run the event loop:

>>> threaded_ticker = call_in_background(tick_5_sync); print("Starts immediately!")
0
Starts immediately!
>>> 1
2
3
4
Finishing

That's both a strength (as you can run multiple blocking IO operations in parallel), but also a significant weakness - one of the benefits of explicit coroutines is their predictability, as you know none of them will start doing anything until you start running the event loop.

Inaugural PyCon Australia Education Miniconf

PyCon Australia launched its Call for Papers just over a month ago, and it closes in a little over a week on Friday the 8th of May.

A new addition to PyCon Australia this year, and one I'm particularly excited about co-organising following Dr James Curran's "Python for Every Child in Australia" keynote last year, is the inaugural Python in Education miniconf as a 4th specialist track on the Friday of the conference, before we move into the main program over the weekend.

From the CFP announcement: "The Python in Education Miniconf aims to bring together community workshop organisers, professional Python instructors and professional educators across primary, secondary and tertiary levels to share their experiences and requirements, and identify areas of potential collaboration with each other and also with the broader Python community."

If that sounds like you, then I'd love to invite you to head over to the conference website and make your submission to the Call for Papers!

This year, all 4 miniconfs (Education, Science & Data Analysis, OpenStack and DjangoCon AU) are running our calls for proposals as part of the main conference CFP - every proposal submitted will be considered for both the main conference and the miniconfs.

I'm also pleased to announce two pre-arranged sessions at the Education Miniconf:

I'm genuinely looking forward to chairing this event, as I see tremendous potential in forging stronger connections between Australian educators (both formal and informal) and the broader Python and open source communities.

Accessing TrueCrypt Encrypted Files on Fedora 22

I recently got a new ultrabook (a HP Spectre 360), which means I finally have enough space to transfer my music files from the external drive where they've been stored for the past few years back to the laptop (there really wasn't enough space for them on my previous laptop, a first generation ASUS Zenbook, but even with the Windows partition still around, the extra storage space on the new device leaves plenty of room for my music collection).

Just one small problem: the bulk of the storage on that drive was in a TrueCrypt encrypted file, and the Dolphin file browser in KDE doesn't support mounting those as volumes through the GUI (at least, it doesn't as far as I could see).

So, off to the command line we go. While TrueCrypt itself isn't readily available for Fedora due to problems with its licensing terms, the standard cryptsetup utility supports accessing existing TrueCrypt volumes, and the tcplay package also supports creation of new volumes.

In my case, I just wanted to read the music files, so it turns out that cryptsetup was all I needed, but I didn't figure that out until after I'd already installed tcplay as well.

For both cryptsetup and tcplay, one of the things you need to set up in order to access a TrueCrypt encrypted file (as opposed to a fully encrypted volume) is a loopback device - these let you map a filesystem block device back to a file living on another filesystem. The examples in the tcplay manual page (man tcplay) indicated the command I needed to set that up was losetup.

However, the losetup instructions gave me trouble, as they appeared to be telling me I didn't have any loopback devices:

[ncoghlan@thechalk ~]$ losetup -f
losetup: cannot find an unused loop device: No such file or directory

Searching on Google for "fedora create a loop device" brought me to this Unix & Linux Stack Exchange question as the first result, but the answer there struck me as being far too low level to be reasonable as a prerequisite for accessing encrypted files as volumes.

So I scanned further down through the list of search results, with this Fedora bug report about difficulty accessing TrueCrypt volumes catching my eye. As with the Stack Overflow answer, most of the comments there seemed to be about reverting the effect of a change to Fedora's default behaviour a change which meant that Fedora no longer came with any loop devices preconfigured.

However, looking more closely at Kay's original request to trim back the list of default devices revealed an interesting statement: "Loop devices can and should be created on-demand, and only when needed, losetup has been updated since Fedora 17 to do that just fine."

That didn't match my own experience with the losetup command, so I wondered what might be going on to explain the discrepancy, which is when it occurred to me that running losetup with root access might solve the problem. Generally speaking, ordinary users aren't going to have the permissions needed to create new devices, and I'd been running the losetup command using my normal user permissions rather than running it as root. That was a fairly straightforward theory to test, and sure enough, that worked:

[ncoghlan@thechalk ~]$ sudo losetup -f
/dev/loop0

Armed with my new loop device, I was then able to open the TrueCrypt encrypted file on the external GoFlex drive as a decrypted volume:

[ncoghlan@thechalk ~]$ sudo cryptsetup open --type tcrypt /dev/loop0 flexdecrypted

Actually supplying the password to decrypt the volume wasn't a problem, as I use a password manager to limit the number of passwords I need to actually remember, while still being able to use strong passwords for different services and devices.

However, even with my music files in the process of copying over to my laptop, this all still seemed a bit cryptic to me, even for the Linux command line. It would have saved me a lot of time if I'd been nudged in the direction of "sudo losetup -f" much sooner, rather than having to decide to ignore some bad advice I found on the internet and instead figure out a better answer by way of the Fedora issue tracker.

So I took four additional steps:

  • First, I filed a new issue against losetup, suggesting that it nudge the user in the direction of running it with root privileges if they first run it as a normal user and don't find any devices
  • Secondly, I followed up on the previous issue I had found in order to explain my findings
  • Thirdly, I added a new answer to the Stack Exchange question I had found, suggesting the use of the higher level losetup command over the lower level mknod command
  • Finally, I wrote this post recounting the tale of figuring this out from a combination of local system manual pages and online searches

Adding a right-click option to Dolphin to be able to automatically mount TrueCrypt encrypted files as volumes and open them would be an even nicer solution, but also a whole lot more work. The only actual change suggested in my above set of additional steps is tweaking a particular error message in one particular situation, which should be far more attainable than a new Dolphin feature or addon.

Stop Supporting Python 2.6 (For Free)

(Note: I'm speaking with my "CPython core developer" hat on in this article, rather than my "Red Hat employee" one, although it's the latter role that gave me broad visibility into the Fedora/RHEL/CentOS Python ecosystem)

Alex Gaynor recently raised some significant concerns in relation to his perception that Red Hat expects the upstream community to support our long term support releases for as long as we do, only without getting paid for it.

That's not true, so I'm going to say it explicitly: if you're currently supporting Python 2.6 for free because folks using RHEL 6 or one of its derivatives say they need it, and this is proving to be a hassle for you, then stop. If they complain, then point them at this post, as providing an easily linkable reference for that purpose is one of the main reasons I'm writing it. If they still don't like it, then you may want to further suggest that they come argue with me about it, and leave you alone.

The affected users have more options than they may realise, and upstream open source developers shouldn't feel obliged to donate their own time to help end users cope with organisations that aren't yet able to upgrade their internal infrastructure in a more timely fashion.

Red Hat Supported Python Upgrade Paths

Since September 2013, Red Hat Enterprise Linux subscriptions have included access to an additional component called Red Hat Software Collections. You can think of Software Collections roughly as "virtualenv for the system package manager", providing access to newer language runtimes (including Python 2.7 and 3.3), database runtimes, and web server runtimes, all without interfering with the versions of those integrated with (and used by) the operating system layer itself.

This model (and the fact they're included with the base Red Hat Enterprise Linux subscription) means that Red Hat subscribers are able to install and use these newer runtimes without needing to upgrade the entire operating system.

Since June 2014, Red Hat Enterprise Linux 7 has also been available, including an upgrade of the system Python to Python 2.7. The latest release of that is Red Hat Enterprise Linux 7.1.

As Red Hat subscriptions all include free upgrades to new releases, the main barrier to deployment of these newer Python releases is institutional inertia. While it's entirely admirable that many upstream developers are generous enough to help their end users work around this inertia, in the long run doing so is detrimental for everyone concerned, as long term sustaining engineering for old releases is genuinely demotivating for upstream developers (it's a good job, but a lousy way to spend your free time), and for end users, working around institutional inertia this way reduces the pressure to actually get the situation addressed properly.

Beyond Red Hat Enterprise Linux, Red Hat's portfolio also includes both the OpenShift Platform-as-a-Service offering and the Feed Henry Mobile Application Platform. For many organisations looking to adopt an iterative approach to web service development, those are going to be a better fit than deploying directly to Red Hat Enterprise Linux and building a custom web service management system around that.

Third Party Supported Python Upgrade Paths

The current Red Hat supported upgrade paths require administrative access to a system. While it's aimed primarily at scientific users, the comprehensive Anaconda Python distribution from Continuum Analytics is a good way to obtain prebuilt versions of Python for Red Hat Enterprise Linux that can be installed by individual users without administrative access.

At Strata 2015, Continuum's Python distribution not only featured in a combined announcement regarding on-premise deployment of Anaconda Cluster together with Red Hat Storage, but also in Microsoft's announcement of Python support in the Azure Machine Learning service

For users that don't need a full scientific Python distribution, Continuum Analytics also offer miniconda which just provides a Python runtime and the conda package manager, providing end users with a cross-platform way to obtain and manage multiple Python runtimes without needing administrative access to their systems.

Community Supported Python Upgrade Paths

The question of providing upgrade paths for folks without an active Red Hat subscription has also been taken into consideration.

In January 2014 Red Hat became an official sponsor of the long established CentOS project, with the aim of providing a stable base for community open source innovation above the operating system layer (this aim contrasts with the aims of the Fedora project, which is intended primarily to drive innovation within the operating system layer itself).

CentOS 7 was originally released in July 2014, and the latest release, CentOS 7(1503), was just published (as the name suggests) in March 2015).

For CentOS and other RHEL derivatives, the upstream project for Red Hat Software Collections is hosted at softwarecollections.org, making these collections available to the whole Fedora/RHEL/CentOS ecosystem, rather than only being available to Red Hat subscribers.

Folks running Fedora Rawhide that are particularly keen to be on the cutting edge of Python can even obtain prerelease Python 3.5 nightly builds as a software collection from Miro Hrončok's Fedora COPR repository.

In addition to maintaining Fedora itself, the Fedora community also maintains the Extra Packages for Enterprise Linux (EPEL) repositories, providing ready access to non-conflicting packages beyond those in the set included in the base Red Hat Enterprise Linx and CentOS releases.

Providing Commercial Red Hat Enterprise Linux Support

If projects are regularly receiving requests for support on Red Hat Enterprise Linux and derived platforms, and the developers involved are actively looking to build a sustainable business around their software, then a steady stream of these requests may represent an opportunity worth exploring. After all, Red Hat's subscribers all appreciate the value that deploying commercially supported open source software can bring to an organisation, and are necessarily familiar with the use of software subscriptions to support sustaining engineering and the ongoing development of new features for the open source software that they deploy.

One of the key things that many customers are looking for is pre-release integration testing to provide some level of assurance that the software they deploy will work in their environment, while another is a secure development and distribution pipeline that ensures that the software they install is coming from organisations that they trust.

One of Red Hat's essential tools for managing this distributed integration testing and content assurance effort is its Partner Program. This program is designed to not only assist Red Hat subscribers in finding supported software that meets their needs, but also to provide Red Hat, partners, and customers with confidence that the components of deployed solutions will work well together in target deployment environments.

Specifically for web service developers that would like to provide a supported on-premise offering, last year's announcement of a Container Certification Program (with Red Hat Enterprise Linux 7 and the OpenShift Platform-as-a-Service offering as certified container hosts) extended Red Hat's certification programs to cover the certification of Docker containers in addition to other forms of Linux application deployment.

Even more recently, the Red Hat Container Development Kit was introduced to help streamline that certification process for Red Hat Independent Software Vendor Partners.

But these are all things that folks should only explore if they're specifically interested in building a commercial support business around their software. If users are trying to get long term maintenance support for a community project for free, then upstream developers should be sending a single unified message in response: don't assume you'll be able to run new versions of open source software on old platforms unless you're specifically paying someone to ensure that happens.

I'm not saying this because I work for a platform vendor that gets paid (at least in part) to do this, I'm saying it because most open source projects are maintained by innovators that are upgrading their technology stacks regularly, where versions of components are already old after 2 years and truly ancient after 5. Expecting open source innovators to provide long term maintenance for free is simply unreasonable - regularly upgrading their own stacks means they don't need this long term platform support for themselves, so the folks that are seeking it should be expected to pay for it to happen. (Apparent exceptions like CentOS aren't exceptions at all: sustaining engineering on CentOS is instead a beneficial community byproduct of the paid sustaining engineering that goes into Red Hat Enterprise Linux).

Abusing Contributors is not OK

As reported in Ars Technica, the ongoing efforts to promote diversity in open source communities came up once more during the plenary Q&A session with Linus Torvalds, Andrew Tridgell, Bdale Garbee and Rusty Russell.

I was there for that session, and found that Linus's response appeared to betray a fundamental misunderstanding of the motives of many of the folks pushing for increased diversity in the open source community, as well as a lack of awareness of the terrible situations that can arise when leaders in a community regularly demonstrate abusive behaviour without suffering any significant consequences (just ask folks like Kathy Sierra, Zoe Quinn, Anita Sarkeesian and Brianna Wu that have been subjected to sustained campaigns of harassment largely for being women that dared to have and express an opinion on the internet).

As the coordinator of the Python Software Foundation's contribution to the linux.conf.au 2015 financial assistance program, and as someone with a deep personal interest in the overall success of the open source community, I feel it is important for me to state explicitly that I consider Linus's level of ignorance around appropriate standards of community conduct to be unacceptable in an open source community leader in 2015.

Linus's defence of his abusive behaviour is that he's "not nice", and "doesn't care about you". He does care deeply about his project, though, and claims to be motivated primarily by wanting that to continue to be successful.

To be completely honest, the momentum behind the Linux juggernaut is now large enough that Linus could likely decide to chuck it all in and spend the rest of his life on a beach sipping cocktails without worrying about open source politics, and people would figure out a way to ensure that Linux continued to thrive and grow without him. Many a successful start-up has made that transition when the founders leave, and there's no good reason to believe an open source community would be fundamentally different in that regard. The transition to a new leadership structure might be a little messy, but the community would almost certainly figure it out.

However, there's still a lot of scope for Linus to influence how fast Linux grows, and on that front his words and actions suggest that he considers being careless in his speech, without regard for the collateral damage his verbal broadsides may be doing to his cause, more important than having the most positive impact he is capable of having on the future growth of the Linux kernel development project and the open source community at large.

It's not (necessarily) about being nice

It may surprise some folks to learn that I don't consider myself a nice human either. My temper is formidable (I just keep it under control most of the time, a task online communication makes easier by providing the ability to walk away from the computer for a while), and any feelings of compassion I have for others are more a product of years of deliberate practice and hanging around with compassionate people than they are any particularly strong innate knack for empathy.

I'm pretty sure that genuinely nice people do exist, and I assume that one of their key motives for creating open, welcoming, inclusive communities is because it's fundamentally the right thing to do. The main reason I exclude myself from my assumed category of "nice people" is that, while I acknowledge that motivation intellectually, it's not really one that I feel viscerally.

Instead, what I do care about, passionately, is helping the best ideas win (where I include "feasible" as part of my definition of "best"). Not the "best ideas from people willing to tolerate extensive personal abuse". The best ideas anyone is willing to share with me, period. And I won't hear those ideas unless I help create environments where all participants are willing to speak up, not just those that are prepared to accept a blistering verbal barrage from a powerful authority figure as a possible consequence of attempting to participate. Those are upsetting enough when they come from random strangers on the internet, when they come from someone with enormous influence not only over you and your future career, but also your entire industry, they can be devastating.

The second order consequences

So there you have it, my not-nice reason for advocating for more welcoming and inclusive open source communities: because, from an engineering standpoint, I consider "has a high level of tolerance for receiving personal abuse from community leaders" to be an extraordinarily stupid filter to apply to your pool of potential contributors.

Exhibiting abusive behaviour as a leader has additional consequences though, and they can be even more problematic: by regularly demonstrating abusive behaviour yourself, you end up normalising harassment within your community in general, both in public and in private.

I believe Linus when he says he doesn't care about who people are or where they're from, only their contributions. I'm the same way - until I've known them for a while, I tend to care about contributors and potential contributors wholesale (i.e. happy people that enjoy the environment they're participating in tend to spend more time engaged, learn faster, produce superior contributions, and more quickly reach the point of being able to contribute independently), rather than retail (i.e. I care about my friends because they're my friends, regardless of context).

But when you're personally abusive as a leader, you also have to take a high level of responsibility for all the folks that look up to you as a role model, and act out the same behaviours you exhibit in public. When you reach this point, the preconditions for participation in your community now include:

  • Willing to tolerate public personal abuse from project leaders
  • Willing to tolerate public personal abuse from the community at large
  • Willing to tolerate personal abuse in private

With clauses like that as part of the definition, the claim of "meritocracy" starts to feel a little shaky, doesn't it? Meritocracy is a fine ideal to strive for, but claiming to have achieved it when you're imposing irrelevant requirements like this is arrogant nonsense.

We're not done yet, though, as this culture of abuse then combines with elitism based on previously acquired knowledge to make it normal to abuse newcomers for still being in the process of learning. I find it hard to conceive of a more effective approach to keeping people from adopting something you're giving away for free than tolerating a community that publicly abuses people for not magically already knowing how to use technology that they may have never even heard of before.

As a result of this perspective, the only time I'll endeavour to eject anyone from a community where I have significant influence is when they're actively creating an unpleasant environment for other participants, and demonstrate no remorse whatsoever regarding the negative impact their actions are having on the overall collaborative effort. I count myself incredibly fortunate to have only had to do this a few times in my life so far, but it's something I believe in strongly enough for it to have been the basis for once deciding to resign from a position paying a six-figure salary at a company I otherwise loved working for. To that company's credit, the abusive leader was let go not long afterwards, but the whole secretive corporate system is rigged such that these toxic "leaders" can usually quickly find new positions elsewhere and hence new subordinates to make miserable - the fact that I'm not willing to name names here for fear of the professional consequences is just one more example of how the system is largely set up to protect abusive leaders rather than holding them to account for the impact their actions have on others.

Ideas and code are still fair game

One of the spurious fears raised against the apparently radical notion of refusing to tolerate personal abuse in a collaborative environment is that adopting civil communication practices somehow means that bad code must then be accepted into the project.

Eliminating personal abuse doesn't mean eliminating rigorous critique of code and ideas. It just means making sure that you are critiquing the code and the ideas, rather than tearing down the person contributing them. It's the difference between "This code isn't any good, here are the problems with it, I'm confident you can do better on your next attempt" (last part optional but usually beneficial when it comes to growing your contributor community) and "This code is terrible, how dare you befoul my presence with it, begone from my sight, worm!".

The latter response may be a funny joke if done in private between close friends, but when it's done in public, in front of a large number of onlookers who don't know either the sender or the recipient personally, it sets an astoundingly bad example as to what a mutually beneficial collaborative peer relationship should look like.

And if you don't have the self-discipline needed to cope with the changing context of our online interactions in the open source community? Well, perhaps you don't yet have the temperament needed to be an open source leader on an internet that is no longer the sole preserve of those of us that are more interested in computers than we are in people. Most of the technical and business press have yet to figure out that they can actually do a bit of investigative journalism to see how well vendor rhetoric aligns with upstream open source engineering activity (frequency of publication is still a far more important performance metric for most journalists than questioning the spin served up in corporate press releases), so the number of folks peering into the open source development fishbowl is only going to grow over time.

It isn't that hard to learn the necessary self-control, though. It's mostly just a matter of taking the time to read each email or code review comment, look for the parts that are about the contributor rather than the code or the design, and remove them before hitting send. And if that means there's nothing left? Then what you were about to send was pure noise, adding nothing useful to the conversation, and hence best left unsent. Doing anything less than this as a community leader is pure self-indulgence, putting your own unwillingness to consider the consequences of your communications ahead of the long term interests of your project. We're only talking about software here, after all - lives aren't on the line when we're deciding how to respond to a particular contribution, so we can afford to take a few moments to review the messages we're about to send and consider how they're likely to be perceived, both by the recipient, and by everyone else observing the exchange.

With any personal abuse removed, you can be as ruthless about critiquing the code and design as you like. Learning not to take critiques of your work personally is a necessary skill to acquire if your ambition is to become a high profile open source developer - the compromises often necessary in the real world of software design and development mean that you will end up shipping things that can legitimately be described as terrible, and you're going to have to learn to be able to say "Yes, I know it's terrible, for reasons X, Y, and Z, and I decided to publish it anyway. If you don't like it, don't use it.". (I highly recommend giving talks about these areas you know are terrible - they're fun to prepare, fun to give, and it's quite entertaining seeing the horrified reactions when people realise I'm not kidding when I say all software is terrible and computers don't actually work, they just fake it fairly well through an ongoing series of horrible hacks built atop other horrible hacks. I'm not surprised the Internet breaks sometimes - given the decades of accumulated legacy hardware and software we're building on and working around, it's thoroughly astonishing that anything technology related ever works at all)

But no matter how harsh your technical critiques get, never forget that there's at least one other human on the far end of that code review or email thread. Even if you don't personally care about them, do you really think it's a good idea to go through life providing large numbers of people with public evidence of why you are a thoroughly unpleasant person to be forced to deal with? As a project leader, do you really think you're going to attract the best and brightest people, who are often free to spend their time and energy however they like, if you set up a sign saying "You must be willing to tolerate extensive personal abuse in order to participate here"?

What can we do about it?

First, and foremost, for those of us that are paid open source community leaders, we can recognise that understanding and motivating our contributors and potential contributors in order to grow our communities is part of our job. If we don't like that, if we'd prefer to be able to "just focus on the code", to the degree where we're not willing to learn how to moderate our own behaviour in accordance with our level of responsibility, then we need to figure out how to reorganise things such that there's someone with better people management and communication skills in a position to act as a buffer between us and our respective communities.

If we instead decide we need to better educate ourselves, then there are plenty of resources available for us to do so. For folks just beginning to explore questions of systemic bias and defaulting to exclusivity, gender-based bias is a good one to start with, by perusing resources like the Feminism 101 section on the Geek Feminism wiki, or (if we have the opportunity) attending an Ada Initiative Ally Skills workshop.

And if we do acknowledge the importance of this work, then we can use our influence to help it continue, whether that's by sponsoring educational workshops, supporting financial assistance programs, ensuring suitable codes of conduct are in place for our events and online communities, supporting programs like the GNOME Outreach Program for Women, or organisations like the Ada Initiative, and so on, and so forth.

For those of us that aren't community leaders, then one of the most effective things we can do is vote with our feet: at last count, there are over a million open source projects in existence, many of them are run in such a way that participating in them is almost always a sheer pleasure, and if no existing community grabs your interest, you always have the option of starting your own.

Personal enjoyment is only one reason for participating in open source though, and professional obligations or personal needs may bring us into contact with project leaders and contributors that currently consider personal abuse to be an acceptable way of interacting with their peers in a collaborative context. If leaving isn't a viable option, then what can we do?

Firstly, the options I suggest above for community leaders are actually good options for any participants in the open source community that view the overall growth and success of the free and open source software ethos as being more important than any one individual's personal pride or reluctance to educate themselves about issues that don't affect them personally.

Secondly, we can hold our leaders to account. When community leaders give presentations at community events, especially when presenting on community management topics, feel free to ask the following questions (or variations on these themes):

  • Are we as community leaders aware of the impact current and historical structural inequalities have on the demographics of our community?
  • What have we done recently as individuals to improve our understanding of these issues and their consequences?
  • What have we done recently as community leaders to attempt to counter the systemic biases adversely affecting the demographics of our communities?

These are questions that open source community leaders should be able to answer. When we can't, I guess the silver lining is that it means we have plenty of scope to get better at what we do. For members of vulnerable groups, an inability for leaders to answer these questions is also a strong sign as to which communities may not yet be able to provide safe spaces for you to participate without experiencing harassment over your identity rather than being critiqued solely based on the quality of your work.

If you ask these questions, you will get people complaining about bringing politics into a technical environment. The folks complaining are simply wrong, as the single most important factor driving the quality of our technology is the quality of our thinking. Assuming we have attained meritocracy (aka "the most effective collaborative environment possible") is sheer foolishness, when a wide array of systemic biases remain in place that work to reduce the depth, breadth, and hence quality, of our collective thinking.

Update 22 Jan, 2015: Minor typo and grammar fixes

Update 19 Apr, 2016: This article is also available in Russian, thanks to Everycloud technologies

Update 1 Dec, 2016: This article is also available in Estonian, thanks to Arija Liepkalnietis