The origin of venvstacks

Alyssa Coghlan

2024-11-01 04:21

There has been a longstanding gap in the Python packaging ecosystem that has somewhat annoyed me, but not enough to do anything about it: we haven't really had a good way to compose multiple layers of Python virtual environments together, allowing large dependencies (like AI and machine learning libraries) to be shared across multiple different application environments without having to install them directly into the base runtime environment.

Utilities for collecting up an entire Python runtime, an application, and all its dependencies into a single deployable artifact have existed since before the turn of the century.

We've had standardised virtual environments (allowing multiple applications to share a base Python runtime and its directly installed third party packages) for almost as long.

We've had zip applications for a long time as well (and other utilities which build on that feature).

We've had tools like wagon which allow us to ship a bundle of prebuilt Python wheel archives and install them on a destination system without needing to download anything else from the internet at installation time.

We've had tools like conda (and more recently uv), which make intelligent use of hard links on local systems to avoid making duplicate copies of completely identical versions of packages.

We've technically had platform specific mechanisms like Linux container images, where the contents of an environment can be built up across multiple container image layers, with the lower layers being shared across multiple image definitions, but have lacked a convenient way to handle the dependency management complications involved in using these tools to share large Python libraries.

But we've never had something which specifically took full advantage of the way Python's import system works to enable robust structural decomposition of Python applications into independently updatable subcomponents (with a granularity larger than single packages).

All of this history meant that I was thoroughly intrigued when a mutual acquaintance introduced me to the creators of the LM Studio personal AI desktop application to discuss a Python packaging problem they had looming on their technical road map: it was clear from user demand and the rate of evolution in the Python AI/ML ecosystem that they needed a way to ship Python AI/ML components directly to their users without having to wait for those capabilities to be made available through native interfaces in other languages (such as Swift, C++, or JavaScript), but it didn't seem obvious to them how they could readily integrate that capability into LM Studio without making the application installation process substantially more complicated for their users.

What started as a consulting contract for a technical proof of concept, and has since turned into a permanent position with the organisation, proved fruitful, and the result is the recently published open source venvstacks utility, which is specifically designed to enable the kind of portable deterministic artifact publishing setup that LM Studio needed, including:

Base runtime layers (based on python-build-standalone)
Framework layers (for shipping large dependencies, such as Apple MLX or PyTorch)
Application layers (including additional unpackaged "launch modules" for app execution)

There are certainly still some technical limitations to be addressed (the dynamic linking problem with layering virtual environments like this is notorious amongst Python packaging experts for a reason), but even in its current form, venvstacks is already capable enough to power the recent inclusion of Apple MLX support in LM Studio.

What does "x = a + b" mean?

Alyssa Coghlan

2019-03-16 06:06

Comments

Shorthand notations and shared context

Guido van Rossum recently put together an excellent post talking about the value of infix binary operators in making certain kinds of operations easier to reason about correctly.

The context inspiring that post is a python-ideas discussion regarding the possibility of adding a shorthand spelling (x = a + b) to Python for the operation:

x = a.copy()
x.update(b)

The PEP for that proposal is still in development, so I'm not going to link to it directly [1], but the paragraph above gives the gist of the idea. Guido's article came in response to the assertion that infix operators don't improve readability, when we have plenty of empirical evidence to show that they do.

Where this article comes from is a key point that Guido's article mentions, but doesn't emphasise: that those readability benefits rely heavily on implicitly shared context between the author of an expression and the readers of that expression.

Without a previous agreement on the semantics, the only possible general answer to the question "What does x = a + b mean?" is "I need more information to answer that".

The original shared context: Algebra

If the additional information supplied is "This is an algebraic expression", then x = a + b is expressing a constraint on the permitted values of x, a, and b.

Specifying x = a - b as an additional constraint would then further allow the reader to infer that x = a and b = 0.

The corresponding Python context: Numbers

The use case for + in Python that most closely corresponds with algebra is using it with numbers - the key differences lie in the meaning of =, rather than the meaning of +.

So if the additional information supplied is "This is a Python assignment statement; a and b are both well-behaved finite numbers", then the reader will be able to infer that x will be the sum of the two numbers.

Inferring the exact numeric type of x would require yet more information about the types of a and b, as types implementing the numeric + operator are expected to participate in a type coercion protocol that gives both operands a chance to carry out the operation, and only raises TypeError if neither type understands the other.

The original algebraic meaning then gets expressed in Python as assert x == a + b, and successful execution of the assignment statement ensures that assertion will pass.

In this context, types implementing the + operator are expected to provide all the properties that would be expected of the corresponding mathematical concepts (a + b == b + a, a + (b + c) == (a + b) + c, etc), subject to the limitations of performing calculations on computers that actually exist.

Another mathematical context: Matrix algebra

If the given expression used uppercase letters, as in X = A + B, then the additional information supplied may instead be "This is a matrix algebra expression". (It's a notational convention in mathematics that matrices be assigned uppercase letters, while lowercase letters indicate scalar values)

For matrices, addition and subtraction are defined as only being valid between matrices of the same size and shape, so if X = A - B were to be supplied as an additional constraint, then the implications would be:

X, A and B are all the same size and shape
B consists entirely of zeroes
X = A

The corresponding Python context: NumPy Arrays

The numpy.ndarray type, and other types implementing the same API, bring the semantics of matrix algebra to Python programming, similar to the way that the builtin numeric types bring the semantics of scalar algebra.

This means that if the additional information supplied is "This is a Python assignment statement; A and B are both matrices of the same size and shape containing well-behaved finite numbers", then the reader will be able to infer that X will be a new matrix of the same shape and size as matrices A and B, with each element in X being the sum of the corresponding elements in A and B.

As with scalar algebra, inferring the exact numeric type of the elements of X would require more information about the types of the elements in A and B, the original algebraic meaning gets expressed in Python as assert X == A + B, successful execution of the assignment statement ensures that assertion will pass, and types implementing + in this context are expected to provide the properties that would be expected of a matrix in mathematics.

Python's string concatenation context

Mathematics doesn't provide a convenient infix notation for concatenating two strings together (aside from writing their names directly next to each other), so programming language designers are forced to choose one.

While this does vary across languages, the most common choice is the one that Python uses: the + operator.

This is formally a distinct operation from numeric addition, with different semantic expectations, and CPython's C API somewhat coincidentally ended up reflecting that distinction by offering two different ways of implementing + on a type: the tp_number->nb_add and tp_sequence->sq_concat slots. (This distinction is absent at the Python level: only __add__, __radd__ and __iadd__ are exposed, and they always populate the relevant tp_number slots in CPython)

The key semantic difference between algebraic addition and string concatenation is that in algebraic addition, the order of the operands doesn't matter (a + b == b + a), while in the string concatenation case, the order of the operands determines which items appear first in the result (e.g. "Hello" + "World" == "HelloWorld" vs "World" + "Hello" == "WorldHello"). This means that a + b == b + a being true when concatenating strings indicates that either one or both strings are empty, or else the two strings are identical.

Another less obvious semantic difference is that strings don't participate in the type coercion protocol that is defined for numbers: if the right hand operand isn't a string (or string subclass) instance, they'll raise TypeError immediately, rather than letting the other operand attempt the operation.

Python's immutable sequence concatenation context

Python goes further than merely allowing + to be used for string concatenation: it allows it to be used for arbitrary sequence concatenation.

For immutable container types like tuple, this closely parallels the way that string concatenation works: a new immutable instance of the same type is created containing references to the same items referenced by the original operands:

>>> a = 1, 2, 3
>>> b = 4, 5, 6
>>> x = a + b
>>> a
(1, 2, 3)
>>> b
(4, 5, 6)
>>> x
(1, 2, 3, 4, 5, 6)

As for strings, immutable sequences will usually only interact with other instances of the same type (or subclasses), even when the x += b notation is used as an alternative to x = x + b. For example:

>>> x = 1, 2, 3
>>> x += [4, 5, 6]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate tuple (not "list") to tuple
>>> x += 4, 5, 6
>>> x
(1, 2, 3, 4, 5, 6)

In addition to str, the tuple, and bytes types implement these concatenation semantics. range and memoryview, while otherwise implementing the Sequence API, don't support concatenation operations.

Python's mutable sequence concatenation context

Mutable sequence types add yet another variation to the possible meanings of + in Python. For the specific example of x = a + b, they're very similar to immutable sequences, creating a fresh instance that references the same items as the original operands:

>>> a = [1, 2, 3]
>>> b = [4, 5, 6]
>>> x = a + b
>>> a
[1, 2, 3]
>>> b
[4, 5, 6]
>>> x
[1, 2, 3, 4, 5, 6]

Where they diverge is that the x += b operation will modify the target sequence directly, rather than creating a new container:

>>> a = [1, 2, 3]
>>> b = [4, 5, 6]
>>> x = a; x = x + b
>>> a
[1, 2, 3]
>>> x = a; x += b
>>> a
[1, 2, 3, 4, 5, 6]

The other difference is that where + remains restrictive as to the container types it will work with, += is typically generalised to work with arbitrary iterables on the right hand side, just like the MutableMapping.extend() method:

>>> x = [1, 2, 3]
>>> x = x + (4, 5, 6)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "tuple") to list
>>> x += (4, 5, 6)
>>> x
[1, 2, 3, 4, 5, 6]

Amongst the builtins, list and bytearray implement these semantics (although bytearray limits even in-place concatenation to bytes-like types that support memoryview style access). Elsewhere in the standard library, collections.deque and array.array are other mutable sequence types that behave this way.

A brief digression back to mathematics: Multisets

Multisets are a concept in mathematics that allow for values to occur in a set more than once, with the multiset then being the mapping from the values themselves to the count of how many times that value occurs in the multiset (with a count of zero or less being the same as the value being omitted from the set entirely).

While they don't natively use the x = a + b notation the way that scalar algebra and matrix algebra do, the key point regarding multisets that's relevant to this article is the fact that they do have a "Sum" operation defined, and the semantics of that operation are very similar to those used for matrix addition: element wise summation for each item in the multiset. If a particular value is only present in one of the multisets, that's handled the same way as if it were present with a count of zero.

And back to Python once more: `collections.Counter`

Since Python 2.7 and 3.1, Python has included an implementation of the mathematical multiset concept in the form of the collections.Counter class. It uses x = a + b to denote multiset summation:

>>> a = collections.Counter(maths=2, python=2)
>>> b = collections.Counter(python=4, maths=1)
>>> x = a + b
>>> x
Counter({'python': 6, 'maths': 3})

As with sequences, counter instances define their own interoperability domain, so they won't accept arbitrary mappings for a binary + operation:

>>> x = a + dict(python=4, maths=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'Counter' and 'dict'

But they're more permissive for in-place operations, accepting arbitrary mapping objects:

>>> x += dict(python=4, maths=1)
>>> x
Counter({'python': 10, 'maths': 4})

What does all this have to do with the idea of dictionary addition?

Python's dictionaries are quite interesting mathematically, as in mathematical terms, they're not actually a container. Instead, they're a function mapping between a domain defined by the set of keys, and a range defined by a multiset of values [2].

This means that the mathematical context that would most closely correspond to defining addition on dictionaries is the algebraic combination of functions. That's defined such that (f + g)(x) is equivalent to f(x) + g(x), so the only binary in-fix operator support for dictionaries that could be grounded in an existing mathematical shared context is one where d1 + d2 was shorthand for:

x = d1.copy()
for k, rhs in d2.items():
    try:
        lhs = x[k]
    except KeyError:
        x[k] = rhs
    else:
        x[k] = lhs + rhs

That has the unfortunate implication that introducing a Python-specific binary operator shorthand for dictionary copy-and-update semantics would represent a hard conceptual break with mathematics, rather than a transfer of existing mathematical concepts into the language. Contrast that with the introduction of collections.Counter (which was grounded in the semantics of mathematical multisets and borrowed its Python notation from element-wise addition on matrices), or the matrix multiplication operator (which was grounded in the semantics of matrix algebra, and only needed a text-editor-friendly symbol assigned, similar to using * instead of × for scalar multiplication and / instead of ÷ for division),

At least to me, that seems like a big leap to take for something where the in-place form already has a perfectly acceptable spelling (d1.update(d2)), and a more expression-friendly variant could be provided as a new dictionary class method:

@classmethod
def from_merge(cls, *inputs):
    self = cls()
    for input in inputs:
        self.update(input)
    return self

With that defined, then the exact equivalent of the proposed d1 + d2 would be type(d1).from_merge(d1, d2), and in practice, you would often give the desired result type explicitly rather than inferring it from the inputs (e.g. dict.from_merge(d1, d2)).

However, the PEP is still in the very first stage of the discussion and review process, so it's entirely possible that by the time it reaches python-dev it will be making a more modest proposal like a new dict class method, rather than the current proposal of operator syntax support.

Considering Python's Target Audience

Alyssa Coghlan

2017-10-09 01:33

Comments

Several years ago, I highlighted "CPython moves both too fast and too slowly" as one of the more common causes of conflict both within the python-dev mailing list, as well as between the active CPython core developers and folks that decide that participating in that process wouldn't be an effective use of their personal time and energy.

I still consider that to be the case, but it's also a point I've spent a lot of time reflecting on in the intervening years, as I wrote that original article while I was still working for Boeing Defence Australia. The following month, I left Boeing for Red Hat Asia-Pacific, and started gaining a redistributor level perspective on open source supply chain management in large enterprises.

Use cases for Python's reference interpreter

While it's a gross oversimplification, I tend to break down CPython's use cases as follows (note that these categories aren't fully distinct, they're just aimed at focusing my thinking on different factors influencing the rollout of new software features and versions):

Education: educator's main interest is in teaching ways of modelling and manipulating the world computationally, not writing or maintaining production software). Examples:
- Australia's Digital Curriculum
- Lorena A. Barba's AeroPython
Personal automation & hobby projects: software where the main, and often only, user is the individual that wrote it. Examples:
- my Digital Blasphemy image download notebook
- Paul Fenwick's (Inter)National Rick Astley Hotline
Organisational process automation: software where the main, and often only, user is the organisation it was originally written to benefit. Examples:
- CPython's core workflow tools
- Development, build & release management tooling for Linux distros
Set-and-forget infrastructure: software where, for sometimes debatable reasons, in-life upgrades to the software itself are nigh impossible, but upgrades to the underlying platform may be feasible. Examples:
- most self-managed corporate and institutional infrastructure (where properly funded sustaining engineering plans are disturbingly rare)
- grant funded software (where maintenance typically ends when the initial grant runs out)
- software with strict certification requirements (where recertification is too expensive for routine updates to be economically viable unless absolutely essential)
- Embedded software systems without auto-upgrade capabilities
Continuously upgraded infrastructure: software with a robust sustaining engineering model, where dependency and platform upgrades are considered routine, and no more concerning than any other code change. Examples:
- Facebook's Python service infrastructure
- Rolling release Linux distributions
- most public PaaS and serverless environments (Heroku, OpenShift, AWS Lambda, Google Cloud Functions, Azure Cloud Functions, etc)
Intermittently upgraded standard operating environments: environments that do carry out routine upgrades to their core components, but those upgrades occur on a cycle measured in years, rather than weeks or months. Examples:
- VFX Platform
- LTS Linux distributions
- CPython and the Python standard library
- Infrastructure management & orchestration tools (e.g. OpenStack, Ansible)
- Hardware control systems
Ephemeral software: software that tends to be used once and then discarded or ignored, rather than being subsequently upgraded in place. Examples:
- Ad hoc automation scripts
- Single-player games with a defined "end" (once you've finished them, even if you forget to uninstall them, you probably won't reinstall them on a new device)
- Single-player games with little or no persistent state (if you uninstall and reinstall them, it doesn't change much about your play experience)
- Event-specific applications (the application was tied to a specific physical event, and once the event is over, that app doesn't matter any more)
Regular use applications: software that tends to be regularly upgraded after deployment. Examples:
- Business management software
- Personal & professional productivity applications (e.g. Blender)
- Developer tools & services (e.g. Mercurial, Buildbot, Roundup)
- Multi-player games, and other games with significant persistent state, but no real defined "end"
- Embedded software systems with auto-upgrade capabilities
Shared abstraction layers: software components that are designed to make it possible to work effectively in a particular problem domain even if you don't personally grasp all the intricacies of that domain yet. Examples:
- most runtime libraries and frameworks fall into this category (e.g. Django, Flask, Pyramid, SQL Alchemy, NumPy, SciPy, requests)
- many testing and type inference tools also fit here (e.g. pytest, Hypothesis, vcrpy, behave, mypy)
- plugins for other applications (e.g. Blender plugins, OpenStack hardware adapters)
- the standard library itself represents the baseline "world according to Python" (and that's an incredibly complex world view)

Which audience does CPython primarily serve?

Ultimately, the main audiences that CPython and the standard library specifically serve are those that, for whatever reason, aren't adequately served by the combination of a more limited standard library and the installation of explicitly declared third party dependencies from PyPI.

To oversimplify the above review of different usage and deployment models even further, it's possible to summarise the single largest split in Python's user base as the one between those that are using Python as a scripting language for some environment of interest, and those that are using it as an application development language, where the eventual artifact that will be distributed is something other than the script that they're working on.

Typical developer behaviours when using Python as a scripting language include:

the main working unit consists of a single Python file (or Jupyter notebook!), rather than a directory of Python and metadata files
there's no separate build step of any kind - the script is distributed as a script, similar to the way standalone shell scripts are distributed
there's no separate install step (other than downloading the file to an appropriate location), as it is expected that the required runtime environment will be preconfigured on the destination system
no explicit dependencies stated, except perhaps a minimum Python version, or else a statement of the expected execution environment. If dependencies outside the standard library are needed, they're expected to be provided by the environment being scripted (whether that's an operating system, a data analysis platform, or an application that embeds a Python runtime)
no separate test suite, with the main test of correctness being "Did the script do what you wanted it to do with the input that you gave it?"
if testing prior to live execution is needed, it will be in the form of a "dry run" or "preview" mode that conveys to the user what the software would do if run that way
if static code analysis tools are used at all, it's via integration into the user's software development environment, rather than being set up separately for each individual script

By contrast, typical developer behaviours when using Python as an application development language include:

the main working unit consists of a directory of Python and metadata files, rather than a single Python file
these is a separate build step to prepare the application for publication, even if it's just bundling the files together into a Python sdist, wheel or zipapp archive
whether there's a separate install step to prepare the application for use will depend on how the application is packaged, and what the supported target environments are
external dependencies are expressed in a metadata file, either directly in the project directory (e.g. pyproject.toml, requirements.txt, Pipfile), or as part of the generated publication archive (e.g. setup.py, flit.ini)
a separate test suite exists, either as unit tests for the Python API, integration tests for the functional interfaces, or a combination of the two
usage of static analysis tools is configured at the project level as part of its testing regime, rather than being dependent on

As a result of that split, the main purpose that CPython and the standard library end up serving is to define the redistributor independent baseline of assumed functionality for educational and ad hoc Python scripting environments 3-5 years after the corresponding CPython feature release.

For ad hoc scripting use cases, that 3-5 year latency stems from a combination of delays in redistributors making new releases available to their users, and users of those redistributed versions taking time to revise their standard operating environments.

In the case of educational environments, educators need that kind of time to review the new features and decide whether or not to incorporate them into the courses they offer their students.

Why is this relevant to anything?

This post was largely inspired by the Twitter discussion following on from this comment of mine citing the Provisional API status defined in PEP 411 as an example of an open source project issuing a de facto invitation to users to participate more actively in the design & development process as co-creators, rather than only passively consuming already final designs.

The responses included several expressions of frustration regarding the difficulty of supporting provisional APIs in higher level libraries, without those libraries making the provisional status transitive, and hence limiting support for any related features to only the latest version of the provisional API, and not any of the earlier iterations.

My main reaction was to suggest that open source publishers should impose whatever support limitations they need to impose to make their ongoing maintenance efforts personally sustainable. That means that if supporting older iterations of provisional APIs is a pain, then they should only be supported if the project developers themselves need that, or if somebody is paying them for the inconvenience. This is similar to my view on whether or not volunteer-driven projects should support older commercial LTS Python releases for free when it's a hassle for them to do: I don't think they should, as I expect most such demands to be stemming from poorly managed institutional inertia, rather than from genuine need (and if the need is genuine, then it should instead be possible to find some means of paying to have it addressed).

However, my second reaction, was to realise that even though I've touched on this topic over the years (e.g. in the original 2011 article linked above, as well as in Python 3 Q & A answers here, here, and here, and to a lesser degree in last year's article on the Python Packaging Ecosystem), I've never really attempted to directly explain the impact it has on the standard library design process.

And without that background, some aspects of the design process, such as the introduction of provisional APIs, or the introduction of inspired-by-but-not-the-same-as, seem completely nonsensical, as they appear to be an attempt to standardise APIs without actually standardising them.

Where does PyPI fit into the picture?

The first hurdle that any proposal sent to python-ideas or python-dev has to clear is answering the question "Why isn't a module on PyPI good enough?". The vast majority of proposals fail at this step, but there are several common themes for getting past it:

rather than downloading a suitable third party library, novices may be prone to copying & pasting bad advice from the internet at large (e.g. this is why the secrets library now exists: to make it less likely people will use the random module, which is intended for games and statistical simulations, for security-sensitive purposes)
the module is intended to provide a reference implementation and to enable interoperability between otherwise competing implementations, rather than necessarily being all things to all people (e.g. asyncio, wsgiref, unittest`, and logging all fall into this category)
the module is intended for use in other parts of the standard library (e.g. enum falls into this category, as does unittest)
the module is designed to support a syntactic addition to the language (e.g. the contextlib, asyncio and typing modules fall into this category)
the module is just plain useful for ad hoc scripting purposes (e.g. pathlib, and ipaddress fall into this category)
the module is useful in an educational context (e.g. the statistics module allows for interactive exploration of statistic concepts, even if you wouldn't necessarily want to use it for full-fledged statistical analysis)

Passing this initial "Is PyPI obviously good enough?" check isn't enough to ensure that a module will be accepted for inclusion into the standard library, but it's enough to shift the question to become "Would including the proposed library result in a net improvement to the typical introductory Python software developer experience over the next few years?"

The introduction of ensurepip and venv modules into the standard library also makes it clear to redistributors that we expect Python level packaging and installation tools to be supported in addition to any platform specific distribution mechanisms.

Why are some APIs changed when adding them to the standard library?

While existing third party modules are sometimes adopted wholesale into the standard library, in other cases, what actually gets added is a redesigned and reimplemented API that draws on the user experience of the existing API, but drops or revises some details based on the additional design considerations and privileges that go with being part of the language's reference implementation.

For example, unlike its popular third party predecessor, path.py`, ``pathlib does not define string subclasses, but instead independent types. Solving the resulting interoperability challenges led to the definition of the filesystem path protocol, allowing a wider range of objects to be used with interfaces that work with filesystem paths.

The API design for the ipaddress module was adjusted to explicitly separate host interface definitions (IP addresses associated with particular IP networks) from the definitions of addresses and networks in order to serve as a better tool for teaching IP addressing concepts, whereas the original ipaddr module is less strict in the way it uses networking terminology.

In other cases, standard library modules are constructed as a synthesis of multiple existing approaches, and may also rely on syntactic features that didn't exist when the APIs for pre-existing libraries were defined. Both of these considerations apply for the asyncio and typing modules, while the latter consideration applies for the dataclasses API being considered in PEP 557 (which can be summarised as "like attrs, but using variable annotations for field declarations").

The working theory for these kinds of changes is that the existing libraries aren't going away, and their maintainers often aren't all that interested in putitng up with the constraints associated with standard library maintenance (in particular, the relatively slow release cadence). In such cases, it's fairly common for the documentation of the standard library version to feature a "See Also" link pointing to the original module, especially if the third party version offers additional features and flexibility that were omitted from the standard library module.

Why are some APIs added in provisional form?

While CPython does maintain an API deprecation policy, we generally prefer not to use it without a compelling justification (this is especially the case while other projects are attempting to maintain compatibility with Python 2.7).

However, when adding new APIs that are inspired by existing third party ones without being exact copies of them, there's a higher than usual risk that some of the design decisions may turn out to be problematic in practice.

When we consider the risk of such changes to be higher than usual, we'll mark the related APIs as provisional, indicating that conservative end users may want to avoid relying on them at all, and that developers of shared abstraction layers may want to consider imposing stricter than usual constraints on which versions of the provisional API they're prepared to support.

Why are only some standard library APIs upgraded?

The short answer here is that the main APIs that get upgraded are those where:

there isn't likely to be a lot of external churn driving additional updates
there are clear benefits for either ad hoc scripting use cases or else in encouraging future interoperability between multiple third party solutions
a credible proposal is submitted by folks interested in doing the work

If the limitations of an existing module are mainly noticeable when using the module for application development purposes (e.g. datetime), if redistributors already tend to make an improved alternative third party option readily available (e.g. requests), or if there's a genuine conflict between the release cadence of the standard library and the needs of the package in question (e.g. certifi), then the incentives to propose a change to the standard library version tend to be significantly reduced.

This is essentially the inverse to the question about PyPI above: since PyPI usually is a sufficiently good distribution mechanism for application developer experience enhancements, it makes sense for such enhancements to be distributed that way, allowing redistributors and platform providers to make their own decisions about what they want to include as part of their default offering.

Changing CPython and the standard library only comes into play when there is perceived value in changing the capabilities that can be assumed to be present by default in 3-5 years time.

Will any parts of the standard library ever be independently versioned?

Yes, it's likely the bundling model used for ensurepip (where CPython releases bundle a recent version of pip without actually making it part of the standard library) may be applied to other modules in the future.

The most probable first candidate for that treatment would be the distutils build system, as switching to such a model would allow the build system to be more readily kept consistent across multiple releases.

Other potential candidates for this kind of treatment would be the Tcl/Tk graphics bindings, and the IDLE editor, which are already unbundled and turned into an optional addon installations by a number of redistributors.

Why do these considerations matter?

By the very nature of things, the folks that tend to be most actively involved in open source development are those folks working on open source applications and shared abstraction layers.

The folks writing ad hoc scripts or designing educational exercises for their students often won't even think of themselves as software developers - they're teachers, system administrators, data analysts, quants, epidemiologists, physicists, biologists, business analysts, market researchers, animators, graphical designers, etc.

When all we have to worry about for a language is the application developer experience, then we can make a lot of simplifying assumptions around what people know, the kinds of tools they're using, the kinds of development processes they're following, and the ways they're going to be building and deploying their software.

Things get significantly more complicated when an application runtime also enjoys broad popularity as a scripting engine. Doing either job well is already difficult, and balancing the needs of both audiences as part of a single project leads to frequent incomprehension and disbelief on both sides.

This post isn't intended to claim that we never make incorrect decisions as part of the CPython development process - it's merely pointing out that the most reasonable reaction to seemingly nonsensical feature additions to the Python standard library is going to be "I'm not part of the intended target audience for that addition" rather than "I have no interest in that, so it must be a useless and pointless addition of no value to anyone, added purely to annoy me".

The Python Packaging Ecosystem

Alyssa Coghlan

2016-09-17 03:46

Comments

There have been a few recent articles reflecting on the current status of the Python packaging ecosystem from an end user perspective, so it seems worthwhile for me to write-up my perspective as one of the lead architects for that ecosystem on how I characterise the overall problem space of software publication and distribution, where I think we are at the moment, and where I'd like to see us go in the future.

For context, the specific articles I'm replying to are:

Python Packaging is Good Now (Glyph Lefkowitz)
Conda: Myths and Misconceptions (Jake VanderPlas)
Python Packaging at PayPal (Mahmoud Hashemi)

These are all excellent pieces considering the problem space from different perspectives, so if you'd like to learn more about the topics I cover here, I highly recommend reading them.

My core software ecosystem design philosophy

Since it heavily influences the way I think about packaging system design in general, it's worth stating my core design philosophy explicitly:

As a software consumer, I should be able to consume libraries, frameworks, and applications in the binary format of my choice, regardless of whether or not the relevant software publishers directly publish in that format
As a software publisher working in the Python ecosystem, I should be able to publish my software once, in a single source-based format, and have it be automatically consumable in any binary format my users care to use

This is emphatically not the way many software packaging systems work - for a great many systems, the publication format and the consumption format are tightly coupled, and the folks managing the publication format or the consumption format actively seek to use it as a lever of control over a commercial market (think operating system vendor controlled application stores, especially for mobile devices).

While we're unlikely to ever pursue the specific design documented in the rest of the PEP (hence the "Deferred" status), the "Development, Distribution, and Deployment of Python Software" section of PEP 426 provides additional details on how this philosophy applies in practice.

I'll also note that while I now work on software supply chain management tooling at Red Hat, that wasn't the case when I first started actively participating in the upstream Python packaging ecosystem design process. Back then I was working on Red Hat's main hardware integration testing system, and growing increasingly frustrated with the level of effort involved in integrating new Python level dependencies into Beaker's RPM based development and deployment model. Getting actively involved in tackling these problems on the Python upstream side of things then led to also getting more actively involved in addressing them on the Red Hat downstream side.

The key conundrum

When talking about the design of software packaging ecosystems, it's very easy to fall into the trap of only considering the "direct to peer developers" use case, where the software consumer we're attempting to reach is another developer working in the same problem domain that we are, using a similar set of development tools. Common examples of this include:

Linux distro developers publishing software for use by other contributors to the same Linux distro ecosystem
Web service developers publishing software for use by other web service developers
Data scientists publishing software for use by other data scientists

In these more constrained contexts, you can frequently get away with using a single toolchain for both publication and consumption:

Linux: just use the system package manager for the relevant distro
Web services: just use the Python Packaging Authority's twine for publication and pip for consumption
Data science: just use conda for everything

For newer languages that start in one particular domain with a preferred package manager and expand outwards from there, the apparent simplicity arising from this homogeneity of use cases may frequently be attributed as an essential property of the design of the package manager, but that perception of inherent simplicity will typically fade if the language is able to successfully expand beyond the original niche its default package manager was designed to handle.

In the case of Python, for example, distutils was designed as a consistent build interface for Linux distro package management, setuptools for plugin management in the Open Source Application Foundation's Chandler project, pip for dependency management in web service development, and conda for local language-independent environment management in data science. distutils and setuptools haven't fared especially well from a usability perspective when pushed beyond their original design parameters (hence the current efforts to make it easier to use full-fledged build systems like Scons and Meson as an alternative when publishing Python packages), while pip and conda both seem to be doing a better job of accommodating increases in their scope of application.

This history helps illustrate that where things really have the potential to get complicated (even beyond the inherent challenges of domain-specific software distribution) is when you start needing to cross domain boundaries. For example, as the lead maintainer of contextlib in the Python standard library, I'm also the maintainer of the contextlib2 backport project on PyPI. That's not a domain specific utility - folks may need it regardless of whether they're using a self-built Python runtime, a pre-built Windows or Mac OS X binary they downloaded from python.org, a pre-built binary from a Linux distribution, a CPython runtime from some other redistributor (homebrew, pyenv, Enthought Canopy, ActiveState, Continuum Analytics, AWS Lambda, Azure Machine Learning, etc), or perhaps even a different Python runtime entirely (PyPy, PyPy.js, Jython, IronPython, MicroPython, VOC, Batavia, etc).

Fortunately for me, I don't need to worry about all that complexity in the wider ecosystem when I'm specifically wearing my contextlib2 maintainer hat - I just publish an sdist and a universal wheel file to PyPI, and the rest of the ecosystem has everything it needs to take care of redistribution and end user consumption without any further input from me.

However, contextlib2 is a pure Python project that only depends on the standard library, so it's pretty much the simplest possible case from a tooling perspective (the only reason I needed to upgrade from distutils to setuptools was so I could publish my own wheel files, and the only reason I haven't switched to using the much simpler pure-Python-only flit instead of either of them is that that doesn't yet easily support publishing backwards compatible setup.py based sdists).

This means that things get significantly more complex once we start wanting to use and depend on components written in languages other than Python, so that's the broader context I'll consider next.

Platform management or plugin management?

When it comes to handling the software distribution problem in general, there are two main ways of approaching it:

design a plugin management system that doesn't concern itself with the management of the application framework that runs the plugins
design a platform component manager that not only manages the plugins themselves, but also the application frameworks that run them

This "plugin manager or platform component manager?" question shows up over and over again in software distribution architecture designs, but the case of most relevance to Python developers is in the contrasting approaches that pip and conda have adopted to handling the problem of external dependencies for Python projects:

pip is a plugin manager for Python runtimes. Once you have a Python runtime (any Python runtime), pip can help you add pieces to it. However, by design, it won't help you manage the underlying Python runtime (just as it wouldn't make any sense to try to install Mozilla Firefox as a Firefox Add-On, or Google Chrome as a Chrome Extension)
conda, by contrast, is a component manager for a cross-platform platform that provides its own Python runtimes (as well as runtimes for other languages). This means that you can get pre-integrated components, rather than having to do your own integration between plugins obtained via pip and language runtimes obtained via other means

What this means is that pip, on its own, is not in any way a direct alternative to conda. To get comparable capabilities to those offered by conda, you have to add in a mechanism for obtaining the underlying language runtimes, which means the alternatives are combinations like:

apt-get + pip
dnf + pip
yum + pip
pyenv + pip
homebrew (Mac OS X) + pip
python.org Windows installer + pip
Enthought Canopy
ActiveState's Python runtime + PyPM

This is the main reason why "just use conda" is excellent advice to any prospective Pythonista that isn't already using one of the platform component managers mentioned above: giving that answer replaces an otherwise operating system dependent or Python specific answer to the runtime management problem with a cross-platform and (at least somewhat) language neutral one.

It's an especially good answer for Windows users, as chocalatey/OneGet/Windows Package Management isn't remotely comparable to pyenv or homebrew at this point in time, other runtime managers don't work on Windows, and getting folks bootstrapped with MinGW, Cygwin or the new (still experimental) Windows Subsystem for Linux is just another hurdle to place between them and whatever goal they're learning Python for in the first place.

However, conda's pre-integration based approach to tackling the external dependency problem is also why "just use conda for everything" isn't a sufficient answer for the Python software ecosystem as a whole.

If you're working on an operating system component for Fedora, Debian, or any other distro, you actually want to be using the system provided Python runtime, and hence need to be able to readily convert your upstream Python dependencies into policy compliant system dependencies.

Similarly, if you're wanting to support folks that deploy to a preconfigured Python environment in services like AWS Lambda, Azure Cloud Functions, Heroku, OpenShift or Cloud Foundry, or that use alternative Python runtimes like PyPy or MicroPython, then you need a publication technology that doesn't tightly couple your releases to a specific version of the underlying language runtime.

As a result, pip and conda end up existing at slightly different points in the system integration pipeline:

Publishing and consuming Python software with pip is a matter of "bring your own Python runtime". This has the benefit that you can readily bring your own runtime (and manage it using whichever tools make sense for your use case), but also has the downside that you must supply your own runtime (which can sometimes prove to be a significant barrier to entry for new Python users, as well as being a pain for cross-platform environment management).
Like Linux system package managers before it, conda takes away the requirement to supply your own Python runtime by providing one for you. This is great if you don't have any particular preference as to which runtime you want to use, but if you do need to use a different runtime for some reason, you're likely to end up fighting against the tooling, rather than having it help you. (If you're tempted to answer "Just add another interpreter to the pre-integrated set!" here, keep in mind that doing so without the aid of a runtime independent plugin manager like pip acts as a multiplier on the platform level integration testing needed, which can be a significant cost even when it's automated)

Where do we go next?

In case it isn't already clear from the above, I'm largely happy with the respective niches that pip and conda are carving out for themselves as a plugin manager for Python runtimes and as a cross-platform platform focused on (but not limited to) data analysis use cases.

However, there's still plenty of scope to improve the effectiveness of the collaboration between the upstream Python Packaging Authority and downstream Python redistributors, as well as to reduce barriers to entry for participation in the ecosystem in general, so I'll go over some of the key areas I see for potential improvement.

Sustainability and the bystander effect

It's not a secret that the core PyPA infrastructure (PyPI, pip, twine, setuptools) is nowhere near as well-funded as you might expect given its criticality to the operations of some truly enormous organisations.

The biggest impact of this is that even when volunteers show up ready and willing to work, there may not be anybody in a position to effectively wrangle those volunteers, and help keep them collaborating effectively and moving in a productive direction.

To secure long term sustainability for the core Python packaging infrastructure, we're only talking amounts on the order of a few hundred thousand dollars a year - enough to cover some dedicated operations and publisher support staff for PyPI (freeing up the volunteers currently handling those tasks to help work on ecosystem improvements), as well as to fund targeted development directed at some of the other problems described below.

However, rather than being a true "tragedy of the commons", I personally chalk this situation up to a different human cognitive bias: the bystander effect.

The reason I think that is that we have so many potential sources of the necessary funding that even folks that agree there's a problem that needs to be solved are assuming that someone else will take care of it, without actually checking whether or not that assumption is entirely valid.

The primary responsibility for correcting that oversight falls squarely on the Python Software Foundation, which is why the Packaging Working Group was formed in order to investigate possible sources of additional funding, as well as to determine how any such funding can be spent most effectively.

However, a secondary responsibility also falls on customers and staff of commercial Python redistributors, as this is exactly the kind of ecosystem level risk that commercial redistributors are being paid to manage on behalf of their customers, and they're currently not handling this particular situation very well. Accordingly, anyone that's actually paying for CPython, pip, and related tools (either directly or as a component of a larger offering), and expecting them to be supported properly as a result, really needs to be asking some very pointed question of their suppliers right about now. (Here's a sample question: "We pay you X dollars a year, and the upstream Python ecosystem is one of the things we expect you to support with that revenue. How much of what we pay you goes towards maintenance of the upstream Python packaging infrastructure that we rely on every day?").

One key point to note about the current situation is that as a 501(c)(3) public interest charity, any work the PSF funds will be directed towards better fulfilling that public interest mission, and that means focusing primarily on the needs of educators and non-profit organisations, rather than those of private for-profit entities.

Commercial redistributors are thus far better positioned to properly represent their customers interests in areas where their priorities may diverge from those of the wider community (closing the "insider threat" loophole in PyPI's current security model is a particular case that comes to mind - see Making PyPI security independent of SSL/TLS).

Migrating PyPI to pypi.org

An instance of the new PyPI implementation (Warehouse) is up and running at https://pypi.org/ and connected directly to the production PyPI database, so folks can already explicitly opt-in to using it over the legacy implementation if they prefer to do so.

However, there's still a non-trivial amount of design, development and QA work needed on the new version before all existing traffic can be transparently switched over to using it.

Getting at least this step appropriately funded and a clear project management plan in place is the main current focus of the PSF's Packaging Working Group.

Making the presence of a compiler on end user systems optional

Between the wheel format and the manylinux1 usefully-distro-independent ABI definition, this is largely handled now, with conda available as an option to handle the relatively small number of cases that are still a problem for pip.

The main unsolved problem is to allow projects to properly express the constraints they place on target environments so that issues can be detected at install time or repackaging time, rather than only being detected as runtime failures. Such a feature will also greatly expand the ability to correctly generate platform level dependencies when converting Python projects to downstream package formats like those used by conda and Linux system package managers.

Bootstrapping dependency management tools on end user systems

With pip being bundled with recent versions of CPython (including CPython 2.7 maintenance releases), and pip (or a variant like upip) also being bundled with most other Python runtimes, the ecosystem bootstrapping problem has largely been addressed for new Python users.

There are still a few usability challenges to be addressed (like defaulting to per-user installations when outside a virtual environment, interoperating more effectively with platform component managers like conda, and providing an officially supported installation interface that works at the Python prompt rather than via the operating system command line), but those don't require the same level of political coordination across multiple groups that was needed to establish pip as the lowest common denominator approach to dependency management for Python applications.

Making the use of distutils and setuptools optional

As mentioned above, distutils was designed ~18 years ago as a common interface for Linux distributions to build Python projects, while setuptools was designed ~12 years ago as a plugin management system for an open source Microsoft Exchange replacement. While both projects have given admirable service in their original target niches, and quite a few more besides, their age and original purpose means they're significantly more complex than what a user needs if all they want to do is to publish their pure Python library or framework to the Python Package index.

Their underlying complexity also makes it incredibly difficult to improve the problematic state of their documentation, which is split between the legacy distutils documentation in the CPython standard library and the additional setuptools specific documentation in the setuptools project.

Accordingly, what we want to do is to change the way build toolchains for Python projects are organised to have 3 clearly distinct tiers:

toolchains for pure Python projects
toolchains for Python projects with simple C extensions
toolchains for C/C++/other projects with Python bindings

This allows folks to be introduced to simpler tools like flit first, better enables the development of potential alternatives to setuptools at the second tier, and supports the use of full-fledged pip-installable build systems like Scons and Meson at the third tier.

The first step in this project, defining the pyproject.toml format to allow declarative specification of the dependencies needed to launch setup.py, has been implemented, and Daniel Holth's enscons project demonstrates that that is already sufficient to bootstrap an external build system even without the later stages of the project.

Future steps include providing native support for pyproject.toml in pip and easy_install, as well as defining a declarative approach to invoking the build system rather than having to run setup.py with the relevant distutils & setuptools flags.

Making PyPI security independent of SSL/TLS

PyPI currently relies entirely on SSL/TLS to protect the integrity of the link between software publishers and PyPI, and between PyPI and software consumers. The only protections against insider threats from within the PyPI administration team are ad hoc usage of GPG artifact signing by some projects, personal vetting of new team members by existing team members and 3rd party checks against previously published artifact hashes unexpectedly changing.

A credible design for end-to-end package signing that adequately accounts for the significant usability issues that can arise around publisher and consumer key management has been available for almost 3 years at this point (see Surviving a Compromise of PyPI and Surviving a Compromise of PyPI: the Maximum Security Edition).

However, implementing that solution has been gated not only on being able to first retire the legacy infrastructure, but also the PyPI administators being able to credibly commit to the key management obligations of operating the signing system, as well as to ensuring that the system-as-implemented actually provides the security guarantees of the system-as-designed.

Accordingly, this isn't a project that can realistically be pursued until the underlying sustainability problems have been suitably addressed.

Automating wheel creation

While redistributors will generally take care of converting upstream Python packages into their own preferred formats, the Python-specific wheel format is currently a case where it is left up to publishers to decide whether or not to create them, and if they do decide to create them, how to automate that process.

Having PyPI take care of this process automatically is an obviously desirable feature, but it's also an incredibly expensive one to build and operate.

Thus, it currently makes sense to defer this cost to individual projects, as there are quite a few commercial continuous integration and continuous deployment service providers willing to offer free accounts to open source projects, and these can also be used for the task of producing release artifacts. Projects also remain free to only publish source artifacts, relying on pip's implicit wheel creation and caching and the appropriate use of private PyPI mirrors and caches to meet the needs of end users.

For downstream platform communities already offering shared build infrastructure to their members (such as Linux distributions and conda-forge), it may make sense to offer Python wheel generation as a supported output option for cross-platform development use cases, in addition to the platform's native binary packaging format.

What problem does it solve?

Alyssa Coghlan

2016-08-06 11:05

Comments

One of the more puzzling aspects of Python for newcomers to the language is the stark usability differences between the standard library's urllib module and the popular (and well-recommended) third party module, requests, when it comes to writing HTTP(S) protocol clients. When your problem is "talk to a HTTP server", the difference in usability isn't immediately obvious, but it becomes clear as soon as additional requirements like SSL/TLS, authentication, redirect handling, session management, and JSON request and response bodies enter the picture.

It's tempting, and entirely understandable, to want to chalk this difference in ease of use up to requests being "Pythonic" (in 2016 terms), while urllib has now become un-Pythonic (despite being included in the standard library).

While there are certainly a few elements of that (e.g. the property builtin was only added in Python 2.2, while urllib2 was included in the original Python 2.0 release and hence couldn't take that into account in its API design), the vast majority of the usability difference relates to an entirely different question we often forget to ask about the software we use: What problem does it solve?

That is, many otherwise surprising discrepancies between urllib/urllib2 and requests are best explained by the fact that they solve different problems, and the problems most HTTP client developers have today are closer to those Kenneth Reitz designed requests to solve in 2010/2011, than they are to the problems that Jeremy Hylton was aiming to solve more than a decade earlier.

It's all in the name

To quote the current Python 3 urllib package documentation: "urllib is a package that collects several modules for working with URLs".

And the docstring from Jeremy's original commit message adding urllib2 to CPython: "An extensible library for opening URLs using a variety [of] protocols".

Wait, what? We're just trying to write a HTTP client, so why is the documentation talking about working with URLs in general?

While it may seem strange to developers accustomed to the modern HTTPS+JSON powered interactive web, it wasn't always clear that that was how things were going to turn out.

At the turn of the century, the expectation was instead that we'd retain a rich variety of data transfer protocols with different characteristics optimised for different purposes, and that the most useful client to have in the standard library would be one that could be used to talk to multiple different kinds of servers (like HTTP, FTP, NFS, etc), without client developers needing to worry too much about the specific protocol used (as indicated by the URL schema).

In practice, things didn't work out that way (mostly due to restrictive institutional firewalls meaning HTTP servers were the only remote services that could be accessed reliably), so folks in 2016 are now regularly comparing the usability of a dedicated HTTP(S)-only client library with a general purpose URL handling library that needs to be configured to specifically be using HTTP(S) before you gain access to most HTTP(S) features.

When it was written, urllib2 was a square peg that was designed to fit into the square hole of "generic URL processing". By contrast, most modern client developers are looking for a round peg to fit into the round hole that is HTTPS+JSON processing - urllib/urllib2 will fit if you shave the corners off first, but requests comes pre-rounded.

So why not add requests to the standard library?

Answering the not-so-obvious question of "What problem does it solve?" then leads to a more obvious follow-up question: if the problems that urllib/ urllib2 were designed to solve are no longer common, while the problems that requests solves are common, why not add requests to the standard library?

If I recall correctly, Guido gave in-principle approval to this idea at a language summit back in 2013 or so (after the requests 1.0 release), and it's a fairly common assumption amongst the core development team that either requests itself (perhaps as a bundled snapshot of an independently upgradable component) or a compatible subset of the API with a different implementation will eventually end up in the standard library.

However, even putting aside the misgivings of the requests developers about the idea, there are still some non-trivial system integration problems to solve in getting requests to a point where it would be acceptable as a standard library component.

In particular, one of the things that requests does to more reliably handle SSL/TLS certificates in a cross-platform way is to bundle the Mozilla Certificate Bundle included in the certifi project. This is a sensible thing to do by default (due to the difficulties of obtaining reliable access to system security certificates in a cross-platform way), but it conflicts with the security policy of the standard library, which specifically aims to delegate certificate management to the underlying operating system. That policy aims to address two needs: allowing Python applications access to custom institutional certificates added to the system certificate store (most notably, private CA certificates for large organisations), and avoiding adding an additional certificate store to end user systems that needs to be updated when the root certificate bundle changes for any other reason.

These kinds of problems are technically solvable, but they're not fun to solve, and the folks in a position to help solve them already have a great many other demands on their time.This means we're not likely to see much in the way of progress in this area as long as most of the CPython and requests developers are pursuing their upstream contributions as a spare time activity, rather than as something they're specifically employed to do.

Propose a talk for the PyCon Australia Education Seminar!

Alyssa Coghlan

2016-05-02 01:29

Comments

The pitch

Involved in Australian education, whether formally or informally? Making use of Python in your classes, workshops or other activities? Interested in sharing your efforts with other Australian educators, and with the developers that create the tools you use? Able to get to the Melbourne Convention & Exhibition Centre on Friday August 12th, 2016?

Then please consider submitting a proposal to speak at the Python in Australian Education seminar at PyCon Australia 2016! More information about the seminar can be found here, while details of the submission process are on the main Call for Proposals page.

Submissions close on Sunday May 8th, but may be edited further after submission (including during the proposal review process based on feedback from reviewers).

PyCon Australia is a community-run conference, so everyone involved is a volunteer (organisers, reviewers, and speakers alike), but accepted speakers are eligible for discounted (or even free) registration, and assistance with other costs is also available to help ensure the conference doesn't miss out on excellent presentations due to financial need (for teachers needing to persuade skeptical school administrators, this assistance may extend to contributing towards the costs of engaging a substitute teacher for the day).

The background

At PyCon Australia 2014, James Curran presented an excellent keynote on "Python for Every Child in Australia", covering some of the history of the National Computer Science School, the development of Australia's National Digital Curriculum (finally approved in September 2015), and the opportunity this represented to introduce the next generation of students to computational thinking in general, and Python in particular.

Encouraged by both Dr Curran's keynote at PyCon Australia, and Professor Lorena Barba's "If There's Computational Thinking, There's Computational Learning" keynote at SciPy 2014, it was my honour and privilege in 2015 not only to invite Carrie Anne Philbin, Education Pioneer at the UK's Raspberry Pi Foundation, to speak at the main conference (on "Designed for Education: a Python Solution"), but also to invite her to keynote the inaugural Python in Australian Education seminar. With the support of the Python Software Foundation and Code Club Australia, Carrie Anne joined QSITE's Peter Whitehouse, Code Club Australia's Kelly Tagalan, and several other local educators, authors and community workshop organisers to present an informative, inspirational and sometimes challenging series of talks.

For 2016, we have a new location in Melbourne (PyCon Australia has a two year rotation in each city, and the Education seminar was launched during the second year in Brisbane), a new co-organiser (Katie Bell of Grok Learning and the National Computer Science School), and a Call for Proposals and financial assistance program that are fully integrated with those for the main conference.

As with the main conference, however, the Python in Australian Education seminar is designed around the idea of real world practitioners sharing information with each other about their day to day experiences, what has worked well for them, and what hasn't, and creating personal connections that can help facilitate additional collaboration throughout the year.

So, in addition to encouraging people to submit their own proposals, I'd also encourage folks to talk to their friends and peers that they'd like to see presenting, and see if they're interested in participating.

27 languages to improve your Python

Alyssa Coghlan

2015-10-11 02:54

Comments

As a co-designer of one of the world's most popular programming languages, one of the more frustrating behaviours I regularly see (both in the Python community and in others) is influential people trying to tap into fears of "losing" to other open source communities as a motivating force for community contributions. (I'm occasionally guilty of this misbehaviour myself, which makes it even easier to spot when others are falling into the same trap).

While learning from the experiences of other programming language communities is a good thing, fear based approaches to motivating action are seriously problematic, as they encourage community members to see members of those other communities as enemies in a competition for contributor attention, rather than as potential allies in the larger challenge of advancing the state of the art in software development. It also has the effect of telling folks that enjoy those other languages that they're not welcome in a community that views them and their peers as "hostile competitors".

In truth, we want there to be a rich smorgasboard of cross platform open source programming languages to choose from, as programming languages are first and foremost tools for thinking - they make it possible for us to convey our ideas in terms so explicit that even a computer can understand them. If someone has found a language to use that fits their brain and solves their immediate problems, that's great, regardless of the specific language (or languages) they choose.

So I have three specific requests for the Python community, and one broader suggestion. First, the specific requests:

If we find it necessary to appeal to tribal instincts to motivate action, we should avoid using tribal fear, and instead aim to use tribal pride. When we use fear as a motivator, as in phrasings like "If we don't do X, we're going to lose developer mindshare to language Y", we're deliberately creating negative emotions in folks freely contributing the results of their work to the world at large. Relying on tribal pride instead leads to phrasings like "It's currently really unclear how to solve problem X in Python. If we look to ecosystem Y, we can see they have a really nice approach to solving problem X that we can potentially adapt to provide a similarly nice user experience in Python". Actively emphasising taking pride in our own efforts, rather than denigrating the efforts of others, helps promote a culture of continuous learning within the Python community and also encourages the development of ever improving collaborative relationships with other communities.
Refrain from adopting attitudes of contempt towards other open source programming language communities, especially if those communities have empowered people to solve their own problems rather than having to wait for commercial software vendors to deign to address them. Most of the important problems in the world aren't profitable to solve (as the folks afflicted by them aren't personally wealthy and don't control institutional funding decisions), so we should be encouraging and applauding the folks stepping up to try to solve them, regardless of what we may think of their technology choices.
If someone we know is learning to program for the first time, and they choose to learn a language we don't personally like, we should support them in their choice anyway. They know what fits their brain better than we do, so the right language for us may not be the right language for them. If they start getting frustrated with their original choice, to the point where it's demotivating them from learning to program at all, then it makes sense to start recommending alternatives. This advice applies even for those of us involved in improving the tragically bad state of network security: the way we solve the problem with inherently insecure languages is by improving operating system sandboxing capabilities, progressively knocking down barriers to adoption for languages with better native security properties, and improving the default behaviours of existing languages, not by confusing beginners with arguments about why their chosen language is a poor choice from an application security perspective. (If folks are deploying unaudited software written by beginners to handle security sensitive tasks, it isn't the folks writing the software that are the problem, it's the folks deploying it without performing appropriate due diligence on the provenance and security properties of that software)

My broader suggestion is aimed at folks that are starting to encounter the limits of the core procedural subset of Python and would hence like to start exploring more of Python's own available "tools for thinking".

Broadening our horizons

One of the things we do as part of the Python core development process is to look at features we appreciate having available in other languages we have experience with, and see whether or not there is a way to adapt them to be useful in making Python code easier to both read and write. This means that learning another programming language that focuses more specifically on a given style of software development can help improve anyone's understanding of that style of programming in the context of Python.

To aid in such efforts, I've provided a list below of some possible areas for exploration, and other languages which may provide additional insight into those areas. Where possible, I've linked to Wikipedia pages rather than directly to the relevant home pages, as Wikipedia often provides interesting historical context that's worth exploring when picking up a new programming language as an educational exercise rather than for immediate practical use.

While I do know many of these languages personally (and have used several of them in developing production systems), the full list of recommendations includes additional languages that I only know indirectly (usually by either reading tutorials and design documentation, or by talking to folks that I trust to provide good insight into a language's strengths and weaknesses).

There are a lot of other languages that could have gone on this list, so the specific ones listed are a somewhat arbitrary subset based on my own interests (for example, I'm mainly interested in the dominant Linux, Android and Windows ecosystems, so I left out the niche-but-profitable Apple-centric Objective-C and Swift programming languages, and I'm not familiar enough with art-focused environments like Processing to even guess at what learning them might teach a Python developer). For a more complete list that takes into account factors beyond what a language might teach you as a developer, IEEE Spectrum's annual ranking of programming language popularity and growth is well worth a look.

Procedural programming: C, Rust, Cython

Python's default execution model is procedural: we start at the top of the main module and execute it statement by statement. All of Python's support for the other approaches to data and computational modelling covered below is built on this procedural foundation.

The C programming language is still the unchallenged ruler of low level procedural programming. It's the core implementation language for the reference Python interpreter, and also for the Linux operating system kernel. As a software developer, learning C is one of the best ways to start learning more about the underlying hardware that executes software applications - C is often described as "portable assembly language", and one of the first applications cross-compiled for any new CPU architecture will be a C compiler

Rust, by contrast, is a relatively new programming language created by Mozilla. The reason it makes this list is because Rust aims to take all of the lessons we've learned as an industry regarding what not to do in C, and design a new language that is interoperable with C libraries, offers the same precise control over hardware usage that is needed in a low level systems programming language, but uses a different compile time approach to data modelling and memory management to structurally eliminate many of the common flaws afflicting C programs (such as buffer overflows, double free errors, null pointer access, and thread synchronisation problems). I'm an embedded systems engineer by training and initial professional experience, and Rust is the first new language I've seen that looks like it may have the potential to scale down to all of the niches currently dominated by C and custom assembly code.

Cython is also a lower level procedural-by-default language, but unlike general purpose languages like C and Rust, Cython is aimed specifically at writing CPython extension modules. To support that goal, Cython is designed as a Python superset, allowing the programmer to choose when to favour the pure Python syntax for flexibility, and when to favour Cython's syntax extensions that make it possible to generate code that is equivalent to native C code in terms of speed and memory efficiency.

Learning one of these languages is likely to provide insight into memory management, algorithmic efficiency, binary interface compatibility, software portability, and other practical aspects of turning source code into running systems.

Object-oriented data modelling: Java, C#, Eiffel

One of the main things we need to do in programming is to model the state of the real world, and offering native syntactic support for object-oriented programming is one of the most popular approaches for doing that: structurally grouping data structures, and methods for operating on those data structures into classes.

Python itself is deliberately designed so that it is possible to use the object-oriented features without first needing to learn to write your own classes. Not every language adopts that approach - those listed in this section are ones that consider learning object-oriented design to be a requirement for using the language at all.

After a major marketing push by Sun Microsystems in the mid-to-late 1990's, Java became the default language for teaching introductory computer science in many tertiary institutions. While it is now being displaced by Python for many educational use cases, it remains one of the most popular languages for the development of business applications. There are a range of other languages that target the common JVM (Java Virtual Machine) runtime, including the Jython implementation of Python. The Dalvik and ART environments for Android systems are based on a reimplementation of the Java programming APIs.

C# is similar in many ways to Java, and emerged as an alternative after Sun and Microsoft failed to work out their business differences around Microsoft's Java implementation, J++. Like Java, it's a popular language for the development of business applications, and there are a range of other languages that target the shared .NET CLR (Common Language Runtime), including the IronPython implementation of Python (the core components of the original IronPython 1.0 implementation were extracted to create the language neutral .NET Dynamic Language Runtime). For a long time, .NET was a proprietary Windows specific technology, with mono as a cross-platform open source reimplementation, but Microsoft shifted to an open source ecosystem strategy in early 2015.

Unlike most of the languages in this list, Eiffel isn't one I'd recommend for practical day-to-day use. Rather, it's one I recommend because learning it taught me an incredible amount about good object-oriented design where "verifiably correct" is a design goal for the application. (Learning Eiffel also taught me a lot about why "verifiably correct" isn't actually a design goal in most software development, as verifiably correct software really doesn't cope well with ambiguity and is entirely unsuitable for cases where you genuinely don't know the relevant constraints yet and need to leave yourself enough wiggle room to be able to figure out the finer details through iterative development).

Learning one of these languages is likely to provide insight into inheritance models, design-by-contract, class invariants, pre-conditions, post-conditions, covariance, contravariance, method resolution order, generic programming, and various other notions that also apply to Python's type system. There are also a number of standard library modules and third party frameworks that use this "visibly object-oriented" design style, such as the unittest and logging modules, and class-based views in the Django web framework.

Object-oriented C derivatives: C++, D

One way of using the CPython runtime is as a "C with objects" programming environment - at its core, CPython is implemented using C's approach to object-oriented programming, which is to define C structs to hold the data of interest, and to pass in instances of the struct as the first argument to functions that then manipulate that data (these are the omnipresent PyObject* pointers in the CPython C API). This design pattern is deliberately mirrored at the Python level in the form of the explicit self and cls arguments to instance methods and class methods.

C++ is a programming language that aimed to retain full source compatibility with C, while adding higher level features like native object-oriented programming support and template based metaprogramming. It's notoriously verbose and hard to program in (although the 2011 update to the language standard addressed many of the worst problems), but it's also the language of choice in many contexts, including 3D modelling graphics engines and cross-platform application development frameworks like Qt.

The D programming language is also interesting, as it has a similar relationship to C++ as Rust has to C: it aims to keep most of the desirable characteristics of C++, while also avoiding many of its problems (like the lack of memory safety). Unlike Rust, D was not a ground up design of a new programming language from scratch - instead, D is a close derivative of C++, and while it isn't a strict C superset as C++ is, it does follow the design principle that any code that falls into the common subset of C and D must behave the same way in both languages.

Learning one of these languages is likely to provide insight into the complexities of combining higher level language features with the underlying C runtime model. Learning C++ is also likely to be useful when using Python to manipulate existing libraries and toolkits written in C++.

Array-oriented data processing: MATLAB/Octave, Julia

Array oriented programming is designed to support numerical programming models: those based on matrix algebra and related numerical methods.

While Python's standard library doesn't support this directly, array oriented programming is taken into account in the language design, with a range of syntactic and semantic features being added specifically for the benefit of the third party NumPy library and similarly array-oriented tools.

In many cases, the Scientific Python stack is adopted as an alternative to the proprietary MATLAB programming environment, which is used extensively for modelling, simulation and numerical data analysis in science and engineering. GNU Octave is an open source alternative that aims to be syntactically compatible with MATLAB code, allowing folks to compare and contrast the two approaches to array-oriented programming.

Julia is another relatively new language, which focuses heavily on array oriented programming and type-based function overloading.

Learning one of these languages is likely to provide insight into the capabilities of the Scientific Python stack, as well as providing opportunities to explore hardware level parallel execution through technologies like OpenCL and Nvidia's CUDA, and distributed data processing through ecosystems like Apache Spark and the Python-specific Blaze.

Statistical data analysis: R

As access to large data sets has grown, so has demand for capable freely available analytical tools for processing those data sets. One such tool is the R programming language, which focuses specifically on statistical data analysis and visualisation.

Learning R is likely to provide insight into the statistical analysis capabilities of the Scientific Python stack, especially the pandas data manipulation library and the seaborn statistical visualisation library.

Computational pipeline modelling: Haskell, Scala, Clojure, F#

Object-oriented data modelling and array-oriented data processing focus a lot of attention on modelling data at rest, either in the form of collections of named attributes or as arrays of structured data.

By contrast, functional programming languages emphasise the modelling of data in motion, in the form of computational flows. Learning at least the basics of functional programming can help greatly improve the structure of data transformation operations even in otherwise procedural, object-oriented or array-oriented applications.

Haskell is a functional programming language that has had a significant influence on the design of Python, most notably through the introduction of list comprehensions in Python 2.0.

Scala is an (arguably) functional programming language for the JVM that, together with Java, Python and R, is one of the four primary programming languages for the Apache Spark data analysis platform. While being designed to encourage functional programming approaches, Scala's syntax, data model, and execution model are also designed to minimise barriers to adoption for current Java programmers (hence the "arguably" - the case can be made that Scala is better categorised as an object-oriented programming language with strong functional programming support).

Clojure is another functional programming language for the JVM that is designed as a dialect of Lisp. It earns its place in this list by being the inspiration for the toolz functional programming toolkit for Python.

F# isn't a language I'm particularly familiar with myself, but seems worth noting as the preferred functional programming language for the .NET CLR.

Learning one of these languages is likely to provide insight into Python's own computational pipeline modelling tools, including container comprehensions, generators, generator expressions, the functools and itertools standard library modules, and third party functional Python toolkits like toolz.

Event driven programming: JavaScript, Go, Erlang, Elixir

Computational pipelines are an excellent way to handle data transformation and analysis problems, but many problems require that an application run as a persistent service that waits for events to occur, and then handles those events. In these kinds of services, it is usually essential to be able to handle multiple events concurrently in order to be able to accommodate multiple users (or at least multiple actions) at the same time.

JavaScript was originally developed as an event handling language for web browsers, permitting website developers to respond locally to client side actions (such as mouse clicks and key presses) and events (such as the page rendering being completed). It is supported in all modern browsers, and together with the HTML5 Domain Object Model, has become a de facto standard for defining the appearance and behaviour of user interfaces.

Go was designed by Google as a purpose built language for creating highly scalable web services, and has also proven to be a very capable language for developing command line applications. The most interesting aspect of Go from a programming language design perspective is its use of Communicating Sequential Processes concepts in its core concurrency model.

Erlang was designed by Ericsson as a purpose built language for creating highly reliable telephony switches and similar devices, and is the language powering the popular RabbitMQ message broker. Erlang uses the Actor model as its core concurrency primitive, passing messages between threads of execution, rather than allowing them to share data directly. While I've never programmed in Erlang myself, my first full-time job involved working with (and on) an Actor-based concurrency framework for C++ developed by an ex-Ericsson engineer, as well as developing such a framework myself based on the TSK (Task) and MBX (Mailbox) primitives in Texas Instrument's lightweight DSP/BIOS runtime (now known as TI-RTOS).

Elixir earns an entry on the list by being a language designed to run on the Erlang VM that exposes the same concurrency semantics as Erlang, while also providing a range of additional language level features to help provide a more well-rounded environment that is more likely to appeal to developers migrating from other languages like Python, Java, or Ruby.

Learning one of these languages is likely to provide insight into Python's own concurrency and parallelism support, including native coroutines, generator based coroutines, the concurrent.futures and asyncio standard library modules, third party network service development frameworks like Twisted and Tornado, the channels concept being introduced to Django, and the event handling loops in GUI frameworks.

Gradual typing: TypeScript

One of the more controversial features that landed in Python 3.5 was the new typing module, which brings a standard lexicon for gradual typing support to the Python ecosystem.

For folks whose primary exposure to static typing is in languages like C, C++ and Java, this seems like an astoundingly terrible idea (hence the controversy).

Microsoft's TypeScript, which provides gradual typing for JavaScript applications provides a better illustration of the concept. TypeScript code compiles to JavaScript code (which then doesn't include any runtime type checking), and TypeScript annotations for popular JavaScript libraries are maintained in the dedicated DefinitelyTyped repository.

As Chris Neugebauer pointed out in his PyCon Australia presentation, this is very similar to the proposed relationship between Python, the typeshed type hint repository, and type inference and analysis tools like mypy.

In essence, both TypeScript and type hinting in Python are ways of writing particular kinds of tests, either as separate files (just like normal tests), or inline with the main body of the code (just like type declarations in statically typed languages). In either case, you run a separate command to actually check that the rest of the code is consistent with the available type assertions (this occurs implicitly as part of the compilation to JavaScript for TypeScript, and as an entirely optional static analysis task for Python's type hinting).

Dynamic metaprogramming: Hy, Ruby

A feature folks coming to Python from languages like C, C++, C# and Java often find disconcerting is the notion that "code is data": the fact that things like functions and classes are runtime objects that can be manipulated like any other object.

Hy is a Lisp dialect that runs on both the CPython VM and the PyPy VM. Lisp dialects take the "code as data" concept to extremes, as Lisp code consists of nested lists describing the operations to be performed (the name of the language itself stands for "LISt Processor"). The great strength of Lisp-style languages is that they make it incredibly easy to write your own domain specific languages. The great weakness of Lisp-style languages is that they make it incredibly easy to write your own domain specific languages, which can sometimes make it difficult to read other people's code.

Ruby is a language that is similar to Python in many respects, but as a community is far more open to making use of dynamic metaprogramming features that are "supported, but not encouraged" in Python. This includes things like reopening class definitions to add additional methods, and using closures to implement core language constructs like iteration.

Learning one of these languages is likely to provide insight into Python's own dynamic metaprogramming support, including function and class decorators, monkeypatching, the unittest.mock standard library module, and third party object proxying modules like wrapt. (I'm not aware of any languages to learn that are likely to provide insight into Python's metaclass system, so if anyone has any suggestions on that front, please mention them in the comments. Metaclasses power features like the core type system, abstract base classes, enumeration types and runtime evaluation of gradual typing expressions)

Pragmatic problem solving: Lua, PHP, Perl

Popular programming languages don't exist in isolation - they exist as part of larger ecosystems of redistributors (both commercial and community focused), end users, framework developers, tool developers, educators and more.

Lua is a popular programming language for embedding in larger applications as a scripting engine. Significant examples include it being the language used to write add-ons for the World of Warcraft game client, and it's also embedded in the RPM utility used by many Linux distributions. Compared to CPython, a Lua runtime will generally be a tenth of the size, and its weaker introspection capabilities generally make it easier to isolate from the rest of the application and the host operating system. A notable contribution from the Lua community to the Python ecosystem is the adoption of the LuaJIT FFI (Foreign Function Interface) as the basis of the JIT-friendly cffi interface library for CPython and PyPy.

PHP is another popular programming language that rose to prominence as the original "P" in the Linux-Apache-MySQL-PHP LAMP stack, due to its focus on producing HTML pages, and its broad availability on early Virtual Private Server hosting providers. For all the handwringing about conceptual flaws in various aspects of its design, it's now the basis of several widely popular open source web services, including the Drupal content management system, the Wordpress blogging engine, and the MediaWiki engine that powers Wikipedia. PHP also powers important services like the Ushahidi platform for crowdsourced community reporting on distributed events.

Like PHP, Perl rose to popularity on the back of Linux. Unlike PHP, which grew specifically as a web development platform, Perl rose to prominence as a system administrator's tool, using regular expressions to string together and manipulate the output of text-based Linux operating system commands. When sh, awk and sed were no longer up to handling a task, Perl was there to take over.

Learning one of these languages isn't likely to provide any great insight into aesthetically beautiful or conceptually elegant programming language design. What it is likely to do is to provide some insight into how programming language distribution and adoption works in practice, and how much that has to do with fortuitous opportunities, accidents of history and lowering barriers to adoption by working with redistributors to be made available by default, rather than the inherent capabilities of the languages themselves.

In particular, it may provide insight into the significance of projects like CKAN, OpenStack NFV, Blender, SciPy, OpenMDAO, PyGMO, PyCUDA, the Raspberry Pi Foundation and Python's adoption by a wide range of commercial organisations, for securing ongoing institutional investment in the Python ecosystem.

Computational thinking: Scratch, Logo

Finally, I fairly regularly get into discussions with functional and object-oriented programming advocates claiming that those kinds of languages are just as easy to learn as procedural ones.

I think the OOP folks have a point if we're talking about teaching through embodied computing (e.g. robotics), where the objects being modelled in software have direct real world counterparts the students can touch, like sensors, motors, and relays.

For everyone else though, I now have a standard challenge: pick up a cookbook, translate one of the recipes into the programming language you're claiming is easy to learn, and then get a student that understands the language the original cookbook was written in to follow the translated recipe. Most of the time folks don't need to actually follow through on this - just running it as a thought experiment is enough to help them realise how much prior knowledge their claim of "it's easy to learn" is assuming. (I'd love to see academic researchers perform this kind of study for real though - I'd be genuinely fascinated to read the results)

Another way to tackle this problem though is to go learn the languages that are actually being used to start teaching computational thinking to children.

One of the most popular of those is Scratch, which uses a drag-and-drop programming interface to let students manipulate a self-contained graphical environment, with sprites moving around and reacting to events in that environment. Graphical environments like Scratch are the programming equivalent of the picture books we use to help introduce children to reading and writing.

This idea of using a special purpose educational language to manipulate a graphical environment isn't new though, with one of the earliest incarnations being the Logo environment created back in the 1960's. In Logo (and similar environments like Python's own turtle module), the main thing you're interacting with is a "turtle", which you can instruct to move around and modify its environment by drawing lines. This way, concepts like command sequences, repetition, and state (e.g. "pen up", "pen down") can be introduced in a way that builds on people's natural intuitions ("imagine you're the turtle, what's going to happen if you turn right 90 degrees?")

Going back and relearning one of these languages as an experienced programmer is most useful as a tool for unlearning: the concepts they introduce help remind us that these are concepts that we take for granted now, but needed to learn at some point as beginners. When we do that, we're better able to work effectively with students and other newcomers, as we're more likely to remember to unpack our chains of logic, including the steps we'd otherwise take for granted.

TCP echo client and server in Python 3.5

Alyssa Coghlan

2015-07-11 06:31

Comments

This is a follow-on from my previous post on Python 3.5's new async/await syntax. Rather than the simple background timers used in the original post, this one will look at the impact native coroutine support has on the TCP echo client and server examples from the asyncio documentation.

First, we'll recreate the run_in_foreground helper defined in the previous post. This helper function makes it easier to work with coroutines from otherwise synchronous code (like the interactive prompt):

def run_in_foreground(task, *, loop=None):
    """Runs event loop in current thread until the given task completes

    Returns the result of the task.
    For more complex conditions, combine with asyncio.wait()
    To include a timeout, combine with asyncio.wait_for()
    """
    if loop is None:
        loop = asyncio.get_event_loop()
    return loop.run_until_complete(asyncio.ensure_future(task, loop=loop))

Next we'll define the coroutine for our TCP echo server implementation, which simply waits to receive up to 100 bytes on each new client connection, and then sends that data back to the client:

async def handle_tcp_echo(reader, writer):
    data = await reader.read(100)
    message = data.decode()
    addr = writer.get_extra_info('peername')
    print("-> Server received %r from %r" % (message, addr))
    print("<- Server sending: %r" % message)
    writer.write(data)
    await writer.drain()
    print("-- Terminating connection on server")
    writer.close()

And then the client coroutine we'll use to send a message and wait for a response:

async def tcp_echo_client(message, port, loop=None):
    reader, writer = await asyncio.open_connection('127.0.0.1', port,
                                                        loop=loop)
    print('-> Client sending: %r' % message)
    writer.write(message.encode())
    data = (await reader.read(100)).decode()
    print('<- Client received: %r' % data)
    print('-- Terminating connection on client')
    writer.close()
    return data

We then use our run_in_foreground helper to interact with these coroutines from the interactive prompt. First, we start the echo server:

>>> make_server = asyncio.start_server(handle_tcp_echo, '127.0.0.1')
>>> server = run_in_foreground(make_server)

Conveniently, since this is a coroutine running in the current thread, rather than in a different thread, we can retrieve the details of the listening socket immediately, including the automatically assigned port number:

>>> server.sockets[0]
<socket.socket fd=6, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1', 40796)>
>>> port = server.sockets[0].getsockname()[1]

Since we haven't needed to hardcode the port number, if we want to define a second server, we can easily do that as well:

>>> make_server2 = asyncio.start_server(handle_tcp_echo, '127.0.0.1')
>>> server2 = run_in_foreground(make_server2)
>>> server2.sockets[0]
<socket.socket fd=7, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1', 41200)>
>>> port2 = server2.sockets[0].getsockname()[1]

Now, both of these servers are configured to run directly in the main thread's event loop, so trying to talk to them using a synchronous client wouldn't work. The client would block the main thread, and the servers wouldn't be able to process incoming connections. That's where our asynchronous client coroutine comes in: if we use that to send messages to the server, then it doesn't block the main thread either, and both the client and server coroutines can process incoming events of interest. That gives the following results:

>>> print(run_in_foreground(tcp_echo_client('Hello World!', port)))
-> Client sending: 'Hello World!'
-> Server received 'Hello World!' from ('127.0.0.1', 44386)
<- Server sending: 'Hello World!'
-- Terminating connection on server
<- Client received: 'Hello World!'
-- Terminating connection on client
Hello World!

Note something important here: you will get exactly that sequence of output messages, as this is all running in the interpreter's main thread, in a deterministic order. If the servers were running in their own threads, we wouldn't have that property (and reliably getting access to the port numbers the server components were assigned by the underlying operating system would also have been far more difficult).

And to demonstrate both servers are up and running:

>>> print(run_in_foreground(tcp_echo_client('Hello World!', port2)))
-> Client sending: 'Hello World!'
-> Server received 'Hello World!' from ('127.0.0.1', 44419)
<- Server sending: 'Hello World!'
-- Terminating connection on server
<- Client received: 'Hello World!'
-- Terminating connection on client
Hello World!

That then raises an interesting question: how would we send messages to the two servers in parallel, while still only using a single thread to manage the client and server coroutines? For that, we'll need another of our helper functions from the previous post, schedule_coroutine:

def schedule_coroutine(target, *, loop=None):
    """Schedules target coroutine in the given event loop

    If not given, *loop* defaults to the current thread's event loop

    Returns the scheduled task.
    """
    if asyncio.iscoroutine(target):
        return asyncio.ensure_future(target, loop=loop)
    raise TypeError("target must be a coroutine, "
                    "not {!r}".format(type(target)))

Update: As with the previous post, this post originally suggested a combined "run_in_background" helper function that handled both scheduling coroutines and calling arbitrary callables in a background thread or process. On further reflection, I decided that was unhelpfully conflating two different concepts, so I replaced it with separate "schedule_coroutine" and "call_in_background" helpers

First, we set up the two client operations we want to run in parallel:

>>> echo1 = schedule_coroutine(tcp_echo_client('Hello World!', port))
>>> echo2 = schedule_coroutine(tcp_echo_client('Hello World!', port2))

Then we use the asyncio.wait function in combination with run_in_foreground to run the event loop until both operations are complete:

>>> run_in_foreground(asyncio.wait([echo1, echo2]))
-> Client sending: 'Hello World!'
-> Client sending: 'Hello World!'
-> Server received 'Hello World!' from ('127.0.0.1', 44461)
<- Server sending: 'Hello World!'
-- Terminating connection on server
-> Server received 'Hello World!' from ('127.0.0.1', 44462)
<- Server sending: 'Hello World!'
-- Terminating connection on server
<- Client received: 'Hello World!'
-- Terminating connection on client
<- Client received: 'Hello World!'
-- Terminating connection on client
({<Task finished coro=<tcp_echo_client() done, defined at <stdin>:1> result='Hello World!'>, <Task finished coro=<tcp_echo_client() done, defined at <stdin>:1> result='Hello World!'>}, set())

And finally, we retrieve our results using the result method of the task objects returned by schedule_coroutine:

>>> echo1.result()
'Hello World!'
>>> echo2.result()
'Hello World!'

We can set up as many concurrent background tasks as we like, and then use asyncio.wait as the foreground task to wait for them all to complete.

But what if we had an existing blocking client function that we wanted or needed to use (e.g. we're using an asyncio server to test a synchronous client API). To handle that case, we use our third helper function from the previous post:

def call_in_background(target, *, loop=None, executor=None):
    """Schedules and starts target callable as a background task

    If not given, *loop* defaults to the current thread's event loop
    If not given, *executor* defaults to the loop's default executor

    Returns the scheduled task.
    """
    if loop is None:
        loop = asyncio.get_event_loop()
    if callable(target):
        return loop.run_in_executor(executor, target)
    raise TypeError("target must be a callable, "
                    "not {!r}".format(type(target)))

To explore this, we'll need a blocking client, which we can build based on Python's existing socket programming HOWTO guide:

import socket
def tcp_echo_client_sync(message, port):
    conn = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    print('-> Client connecting to port: %r' % port)
    conn.connect(('127.0.0.1', port))
    print('-> Client sending: %r' % message)
    conn.send(message.encode())
    data = conn.recv(100).decode()
    print('<- Client received: %r' % data)
    print('-- Terminating connection on client')
    conn.close()
    return data

We can then use functools.partial in combination with call_in_background to start client requests in multiple operating system level threads:

>>> query_server = partial(tcp_echo_client_sync, "Hello World!", port)
>>> query_server2 = partial(tcp_echo_client_sync, "Hello World!", port2)
>>> bg_call = call_in_background(query_server)
-> Client connecting to port: 35876
-> Client sending: 'Hello World!'
>>> bg_call2 = call_in_background(query_server2)
-> Client connecting to port: 41672
-> Client sending: 'Hello World!'

Here we see that, unlike our coroutine clients, the synchronous clients have started running immediately in a separate thread. However, because the event loop isn't currently running in the main thread, they've blocked waiting for a response from the TCP echo servers. As with the coroutine clients, we address that by running the event loop in the main thread until our clients have both received responses:

>>> run_in_foreground(asyncio.wait([bg_call, bg_call2]))
-> Server received 'Hello World!' from ('127.0.0.1', 52585)
<- Server sending: 'Hello World!'
-- Terminating connection on server
-> Server received 'Hello World!' from ('127.0.0.1', 34399)
<- Server sending: 'Hello World!'
<- Client received: 'Hello World!'
-- Terminating connection on server
-- Terminating connection on client
<- Client received: 'Hello World!'
-- Terminating connection on client
({<Future finished result='Hello World!'>, <Future finished result='Hello World!'>}, set())
>>> bg_call.result()
'Hello World!'
>>> bg_call2.result()
'Hello World!'

Background tasks in Python 3.5

Alyssa Coghlan

2015-07-10 08:17

Comments

One of the recurring questions with asyncio is "How do I execute one or two operations asynchronously in an otherwise synchronous application?"

Say, for example, I have the following code:

>>> import itertools, time
>>> def ticker():
...     for i in itertools.count():
...         print(i)
...         time.sleep(1)
...
>>> ticker()
0
1
2
3
^CTraceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "<stdin>", line 4, in ticker
KeyboardInterrupt

With the native coroutine syntax coming in Python 3.5, I can change that synchronous code into event-driven asynchronous code easily enough:

import asyncio, itertools
async def ticker():
    for i in itertools.count():
        print(i)
        await asyncio.sleep(1)

But how do I arrange for that ticker to start running in the background? What's the coroutine equivalent of appending & to a shell command?

It turns out it looks something like this:

import asyncio
def schedule_coroutine(target, *, loop=None):
    """Schedules target coroutine in the given event loop

    If not given, *loop* defaults to the current thread's event loop

    Returns the scheduled task.
    """
    if asyncio.iscoroutine(target):
        return asyncio.ensure_future(target, loop=loop)
    raise TypeError("target must be a coroutine, "
                    "not {!r}".format(type(target)))

Update: This post originally suggested a combined "run_in_background" helper function that handle both scheduling coroutines and calling arbitrary callables in a background thread or process. On further reflection, I decided that was unhelpfully conflating two different concepts, so I replaced it with separate "schedule_coroutine" and "call_in_background" helpers

So now I can do:

>>> import itertools
>>> async def ticker():
...     for i in itertools.count():
...         print(i)
...         await asyncio.sleep(1)
...
>>> ticker1 = schedule_coroutine(ticker())
>>> ticker1
<Task pending coro=<ticker() running at <stdin>:1>>

But how do I run that for a while? The event loop won't run unless the current thread starts it running and either stops when a particular event occurs, or when explicitly stopped. Another helper function covers that:

def run_in_foreground(task, *, loop=None):
    """Runs event loop in current thread until the given task completes

    Returns the result of the task.
    For more complex conditions, combine with asyncio.wait()
    To include a timeout, combine with asyncio.wait_for()
    """
    if loop is None:
        loop = asyncio.get_event_loop()
    return loop.run_until_complete(asyncio.ensure_future(task, loop=loop))

And then I can do:

>>> run_in_foreground(asyncio.sleep(5))
0
1
2
3
4

Here we can see the background task running while we wait for the foreground task to complete. And if I do it again with a different timeout:

>>> run_in_foreground(asyncio.sleep(3))
5
6
7

We see that the background task picked up again right where it left off the first time.

We can also single step the event loop with a zero second sleep (the ticks reflect the fact there was more than a second delay between running each command):

>>> run_in_foreground(asyncio.sleep(0))
8
>>> run_in_foreground(asyncio.sleep(0))
9

And start a second ticker to run concurrently with the first one:

>>> ticker2 = schedule_coroutine(ticker())
>>> ticker2
<Task pending coro=<ticker() running at <stdin>:1>>
>>> run_in_foreground(asyncio.sleep(0))
0
10

The asynchronous tickers will happily hang around in the background, ready to resume operation whenever I give them the opportunity. If I decide I want to stop one of them, I can cancel the corresponding task:

>>> ticker1.cancel()
True
>>> run_in_foreground(asyncio.sleep(0))
1
>>> ticker2.cancel()
True
>>> run_in_foreground(asyncio.sleep(0))

But what about our original synchronous ticker? Can I run that as a background task? It turns out I can, with the aid of another helper function:

def call_in_background(target, *, loop=None, executor=None):
    """Schedules and starts target callable as a background task

    If not given, *loop* defaults to the current thread's event loop
    If not given, *executor* defaults to the loop's default executor

    Returns the scheduled task.
    """
    if loop is None:
        loop = asyncio.get_event_loop()
    if callable(target):
        return loop.run_in_executor(executor, target)
    raise TypeError("target must be a callable, "
                    "not {!r}".format(type(target)))

However, I haven't figured out how to reliably cancel a task running in a separate thread or process, so for demonstration purposes, we'll define a variant of the synchronous version that stops automatically after 5 ticks rather than ticking indefinitely:

import itertools, time
def tick_5_sync():
    for i in range(5):
        print(i)
        time.sleep(1)
    print("Finishing")

The key difference between scheduling a callable in a background thread and scheduling a coroutine in the current thread, is that the callable will start executing immediately, rather than waiting for the current thread to run the event loop:

>>> threaded_ticker = call_in_background(tick_5_sync); print("Starts immediately!")
0
Starts immediately!
>>> 1
2
3
4
Finishing

That's both a strength (as you can run multiple blocking IO operations in parallel), but also a significant weakness - one of the benefits of explicit coroutines is their predictability, as you know none of them will start doing anything until you start running the event loop.

Inaugural PyCon Australia Education Miniconf

Alyssa Coghlan

2015-04-27 10:43

Comments

PyCon Australia launched its Call for Papers just over a month ago, and it closes in a little over a week on Friday the 8th of May.

A new addition to PyCon Australia this year, and one I'm particularly excited about co-organising following Dr James Curran's "Python for Every Child in Australia" keynote last year, is the inaugural Python in Education miniconf as a 4th specialist track on the Friday of the conference, before we move into the main program over the weekend.

From the CFP announcement: "The Python in Education Miniconf aims to bring together community workshop organisers, professional Python instructors and professional educators across primary, secondary and tertiary levels to share their experiences and requirements, and identify areas of potential collaboration with each other and also with the broader Python community."

If that sounds like you, then I'd love to invite you to head over to the conference website and make your submission to the Call for Papers!

This year, all 4 miniconfs (Education, Science & Data Analysis, OpenStack and DjangoCon AU) are running our calls for proposals as part of the main conference CFP - every proposal submitted will be considered for both the main conference and the miniconfs.

I'm also pleased to announce two pre-arranged sessions at the Education Miniconf:

Carrie Anne Philbin, author of "Adventures in Raspberry Pi" and Education Pioneer at the Raspberry Pi Foundation will be speaking on the Foundation's Picademy professional development program for primary and secondary teachers
Peter Whitehouse, author of IPT - a virtual approach (online since 1992!) and a longstanding board member of the Queensland Society for Information Technology in Education will provide insight into some of the challenges and opportunities of integrating Python and other open source software into Australian IT education

I'm genuinely looking forward to chairing this event, as I see tremendous potential in forging stronger connections between Australian educators (both formal and informal) and the broader Python and open source communities.