Posts/2014/08

The transition to multilingual programming

A recent thread on python-dev prompted me to summarise the current state of the ongoing industry wide transition from bilingual to multilingual programming as it relates to Python's cross-platform support. It also relates to the reasons why Python 3 turned out to be more disruptive than the core development team initially expected.

A good starting point for anyone interested in exploring this topic further is the "Origin and development" section of the Wikipedia article on Unicode, but I'll hit the key points below.

Monolingual computing

At their core, computers only understand single bits. Everything above that is based on conventions that ascribe higher level meanings to particular sequences of bits. One particular important set of conventions for communicating between humans and computers are "text encodings": conventions that map particular sequences of bits to text in the actual languages humans read and write.

One of the oldest encodings still in common use is ASCII (which stands for "American Standard Code for Information Interchange"), developed during the 1960's (it just had its 50th birthday in 2013). This encoding maps the letters of the English alphabet (in both upper and lower case), the decimal digits, various punctuation characters and some additional "control codes" to the 128 numbers that can be encoded as a 7-bit sequence.

Many computer systems today still only work correctly with English - when you encounter such a system, it's a fairly good bet that either the system itself, or something it depends on, is limited to working with ASCII text. (If you're really unlucky, you might even get to work with modal 5-bit encodings like ITA-2, as I have. The legacy of the telegraph lives on!)

Working with local languages

The first attempts at dealing with this limitation of ASCII simply assigned meanings to the full range of 8-bit sequences. Known collectively as "Extended ASCII", each of these systems allowed for an additional 128 characters, which was enough to handle many European and Cyrillic scripts. Even 256 characters was nowhere near sufficient to deal with Indic or East Asian languages, however, so this time also saw a proliferation of ASCII incompatible encodings like ShiftJIS, ISO-2022 and Big5. This is why Python ships with support for dozens of codecs from around the world.

This proliferation of encodings required a way to tell software which encoding should be used to read the data. For protocols that were originally designed for communication between computers, agreeing on a common text encoding is usually handled as part of the protocol. In cases where no encoding information is supplied (or to handle cases where there is a mismatch between the claimed encoding and the actual encoding), then applications may make use of "encoding detection" algorithms, like those provided by the chardet package for Python. These algorithms aren't perfect, but can give good answers when given a sufficient amount of data to work with.

Local operating system interfaces, however, are a different story. Not only don't they inherently convey encoding information, but the nature of the problem is such that trying to use encoding detection isn't practical. Two key systems arose in an attempt to deal with this problem:

  • Windows code pages
  • POSIX locale encodings

With both of these systems, a program would pick a code page or locale, and use the corresponding text encoding to decide how to interpret text for display to the user or combination with other text. This may include deciding how to display information about the contents of the computer itself (like listing the files in a directory).

The fundamental premise of these two systems is that the computer only needs to speak the language of its immediate users. So, while the computer is theoretically capable of communicating in any language, it can effectively only communicate with humans in one language at a time. All of the data a given application was working with would need to be in a consistent encoding, or the result would be uninterpretable nonsense, something the Japanese (and eventually everyone else) came to call mojibake.

It isn't a coincidence that the name for this concept came from an Asian country: the encoding problems encountered there make the issues encountered with European and Cyrillic languages look trivial by comparison.

Unfortunately, this "bilingual computing" approach (so called because the computer could generally handle English in addition to the local language) causes some serious problems once you consider communicating between computers. While some of those problems were specific to network protocols, there are some more serious ones that arise when dealing with nominally "local" interfaces:

  • networked computing meant one username might be used across multiple systems, including different operating systems
  • network drives allow a single file server to be accessed from multiple clients, including different operating systems
  • portable media (like DVDs and USB keys) allow the same filesystem to be accessed from multiple devices at different points in time
  • data synchronisation services like Dropbox need to faithfully replicate a filesystem hierarchy not only across different desktop environments, but also to mobile devices

For these protocols that were originally designed only for local interoperability communicating encoding information is generally difficult, and it doesn't necessarily match the claimed encoding of the platform you're running on.

Unicode and the rise of multilingual computing

The path to addressing the fundamental limitations of bilingual computing actually started more than 25 years ago, back in the late 1980's. An initial draft proposal for a 16-bit "universal encoding" was released in 1988, the Unicode Consortium was formed in early 1991 and the first volume of the first version of Unicode was published later that same year.

Microsoft added new text handling and operating system APIs to Windows based on the 16-bit C level wchar_t type, and Sun also adopted Unicode as part of the core design of Java's approach to handling text.

However, there was a problem. The original Unicode design had decided that "16 bits ought to be enough for anybody" by restricting their target to only modern scripts, and only frequently used characters within those scripts. However, when you look at the "rarely used" Kanji and Han characters for Japanese and Chinese, you find that they include many characters that are regularly used for the names of people and places - they're just largely restricted to proper nouns, and so won't show up in a normal vocabulary search. So Unicode 2.0 was defined in 1996, expanding the system out to a maximum of 21 bits per code point (using up to 32 bits per code point for storage).

As a result, Windows (including the CLR) and Java now use the little-endian variant of UTF-16 to allow their text APIs to handle arbitrary Unicode code points. The original 16-bit code space is now referred to as the Basic Multilingual Plane.

While all that was going on, the POSIX world ended up adopting a different strategy for migrating to full Unicode support: attempting to standardise on the ASCII compatible UTF-8 text encoding.

The choice between using UTF-8 and UTF-16-LE as the preferred local text encoding involves some complicated trade-offs, and that's reflected in the fact that they have ended up being at the heart of two competing approaches to multilingual computing.

Choosing UTF-8 aims to treat formatting text for communication with the user as "just a display issue". It's a low impact design that will "just work" for a lot of software, but it comes at a price:

  • because encoding consistency checks are mostly avoided, data in different encodings may be freely concatenated and passed on to other applications. Such data is typically not usable by the receiving application.
  • for interfaces without encoding information available, it is often necessary to assume an appropriate encoding in order to display information to the user, or to transform it to a different encoding for communication with another system that may not share the local system's encoding assumptions. These assumptions may not be correct, but won't necessarily cause an error - the data may just be silently misinterpreted as something other than what was originally intended.
  • because data is generally decoded far from where it was introduced, it can be difficult to discover the origin of encoding errors.
  • as a variable width encoding, it is more difficult to develop efficient string manipulation algorithms for UTF-8. Algorithms originally designed for fixed width encodings will no longer work.
  • as a specific instance of the previous point, it isn't possible to split UTF-8 encoded text at arbitrary locations. Care needs to be taken to ensure splits only occur at code point boundaries.

UTF-16-LE shares the last two problem, but to a lesser degree (simply due to the fact most commonly used code points are in the 16-bit Basic Multilingual Plane). However, because it isn't generally suitable for use in network protocols and file formats (without significant additional encoding markers), the explicit decoding and encoding required encourages designs with a clear separation between binary data (including encoded text) and decoded text data.

Through the lens of Python

Python and Unicode were born on opposites side of the Atlantic ocean at roughly the same time (1991). The growing adoption of Unicode within the computing industry has had a profound impact on the evolution of the language.

Python 1.x was purely a product of the bilingual computing era - it had no support for Unicode based text handling at all, and was hence largely limited to 8-bit ASCII compatible encodings for text processing.

Python 2.x was still primarily a product of the bilingual era, but added multilingual support as an optional addon, in the form of the unicode type and support for a wide variety of text encodings. PEP 100 goes into the many technical details that needed to be covered in order to incorporate that feature. With Python 2, you can make multilingual programming work, but it requires an active decision on the part of the application developer, or at least that they follow the guidelines of a framework that handles the problem on their behalf.

By contrast, Python 3.x is designed to be a native denizen of the multilingual computing world. Support for multiple languages extends as far as the variable naming system, such that languages other than English become almost as well supported as English already was in Python 2. While the English inspired keywords and the English naming in the standard library and on the Python Package Index mean that Python's "native" language and the preferred language for global collaboration will always be English, the new design allows a lot more flexibility when working with data in other languages.

Consider processing a data table where the headings are names of Japanese individuals, and we'd like to use collections.namedtuple to process each row. Python 2 simply can't handle this task:

>>> from collections import namedtuple
>>> People = namedtuple("People", u"陽斗 慶子 七海")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/collections.py", line 310, in namedtuple
    field_names = map(str, field_names)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Users need to either restrict themselves to dictionary style lookups rather than attribute access, or else used romanised versions of their names (Haruto, Keiko, Nanami for the example). However, the case of "Haruto" is an interesting one, as there at least 3 different ways of writing that as Kanji (陽斗, 陽翔, 大翔), but they are all romanised as the same string (Haruto). If you try to use romaaji to handle a data set that contains more than one variant of that name, you're going to get spurious collisions.

Python 3 takes a very different perspective on this problem. It says it should just work, and it makes sure it does:

>>> from collections import namedtuple
>>> People = namedtuple("People", u"陽斗 慶子 七海")
>>> d = People(1, 2, 3)
>>> d.陽斗
1
>>> d.慶子
2
>>> d.七海
3

This change greatly expands the kinds of "data driven" use cases Python can support in areas where the ASCII based assumptions of Python 2 would cause serious problems.

Python 3 still needs to deal with improperly encoded data however, so it provides a mechanism for arbitrary binary data to be "smuggled" through text strings in the Unicode Private Use Area. This feature was added by PEP 383 and is managed through the surrogateescape error handler, which is used by default on most operating system interfaces. This recreates the old Python 2 behaviour of passing improperly encoded data through unchanged when dealing solely with local operating system interfaces, but complaining when such improperly encoded data is injected into another interface. The codec error handling system provides several tools to deal with these files, and we're looking at adding a few more relevant convenience functions for Python 3.5.

The underlying Unicode changes in Python 3 also made PEP 393 possible, which changed the way the CPython interpreter stores text internally. In Python 2, even pure ASCII strings would consume four bytes per code point on Linux systems. Using the "narrow build" option (as the Python 2 Windows builds from python.org do) reduced that the only two bytes per code point when operating within the Basic Multilingual Plane, but at the cost of potentially producing wrong answers when asked to operate on code points outside the Basic Multilingual Plane. By contrast, starting with Python 3.3, CPython now stores text internally using the smallest fixed width data unit possible. That is, latin-1 text uses 8 bits per code point, UCS-2 (Basic Multilingual Plane) text uses 16-bits per code point, and only text containing code points outside the Basic Multilingual Plane will expand to needing the full 32 bits per code point. This can not only significantly reduce the amount of memory needed for multilingual applications, but may also increase their speed as well (as reducing memory usage also reduces the time spent copying data around).

Are we there yet?

In a word, no. Not for Python 3.4, and not for the computing industry at large. We're much closer than we ever have been before, though. Most POSIX systems now default to UTF-8 as their default encoding, and many systems offer a C.UTF-8 locale as an alternative to the traditional ASCII based C locale. When dealing solely with properly encoded data and metadata, and properly configured systems, Python 3 should "just work", even when exchanging data between different platforms.

For Python 3, the remaining challenges fall into a few areas:

  • helping existing Python 2 users adopt the optional multilingual features that will prepare them for eventual migration to Python 3 (as well as reassuring those users that don't wish to migrate that Python 2 is still fully supported, and will remain so for at least the next several years, and potentially longer for customers of commercial redistributors)
  • adding back some features for working entirely in the binary domain that were removed in the original Python 3 transition due to an initial assessment that they were operations that only made sense on text data (PEP 361 summary: bytes.__mod__ is coming back in Python 3.5 as a valid binary domain operation, bytes.format stays gone as an operation that only makes sense when working with actual text data)
  • better handling of improperly decoded data, including poor encoding recommendations from the operating system (for example, Python 3.5 will be more sceptical when the operating system tells it the preferred encoding is ASCII and will enable the surrogateescape error handler on sys.stdout when it occurs)
  • eliminating most remaining usage of the legacy code page and locale encoding systems in the CPython interpreter (this most notably affects the Windows console interface and argument decoding on POSIX. While these aren't easy problems to solve, it will still hopefully be possible to address them for Python 3.5)

More broadly, each major platform has its own significant challenges to address:

  • for POSIX systems, there are still a lot of systems that don't use UTF-8 as the preferred encoding and the assumption of ASCII as the preferred encoding in the default C locale is positively archaic. There is also still a lot of POSIX software that still believes in the "text is just encoded bytes" assumption, and will happily produce mojibake that makes no sense to other applications or systems.
  • for Windows, keeping the old 8-bit APIs around was deemed necessary for backwards compatibility, but this also means that there is still a lot of Windows software that simply doesn't handle multilingual computing correctly.
  • for both Windows and the JVM, a fair amount of nominally multilingual software actually only works correctly with data in the basic multilingual plane. This is a smaller problem than not supporting multilingual computing at all, but was quite a noticeable problem in Python 2's own Windows support.

Mac OS X is the platform most tightly controlled by any one entity (Apple), and they're actually in the best position out of all of the current major platforms when it comes to handling multilingual computing correctly. They've been one of the major drivers of Unicode since the beginning (two of the authors of the initial Unicode proposal were Apple engineers), and were able to force the necessary configuration changes on all their systems, rather than having to work with an extensive network of OEM partners (Windows, commercial Linux vendors) or relatively loose collaborations of individuals and organisations (community Linux distributions).

Modern mobile platforms are generally in a better position than desktop operating systems, mostly by virtue of being newer, and hence defined after Unicode was better understood. However, the UTF-8 vs UTF-16-LE distinction for text handling exists even there, thanks to the Java inspired Dalvik VM in Android (plus the cloud-backed nature of modern smartphones means you're even more likely to be encounter files from multiple machines when working on a mobile device).

Why Python 4.0 won't be like Python 3.0

Newcomers to python-ideas occasionally make reference to the idea of "Python 4000" when proposing backwards incompatible changes that don't offer a clear migration path from currently legal Python 3 code. After all, we allowed that kind of change for Python 3.0, so why wouldn't we allow it for Python 4.0?

I've heard that question enough times now (including the more concerned phrasing "You made a big backwards compatibility break once, how do I know you won't do it again?"), that I figured I'd record my answer here, so I'd be able to refer people back to it in the future.

What are the current expectations for Python 4.0?

My current expectation is that Python 4.0 will merely be "the release that comes after Python 3.9". That's it. No profound changes to the language, no major backwards compatibility breaks - going from Python 3.9 to 4.0 should be as uneventful as going from Python 3.3 to 3.4 (or from 2.6 to 2.7). I even expect the stable Application Binary Interface (as first defined in PEP 384) to be preserved across the boundary.

At the current rate of language feature releases (roughly every 18 months), that means we would likely see Python 4.0 some time in 2023, rather than seeing Python 3.10.

Update: After this post was originally written back in 2014, subsequent discussions on the core python-dev mailing list led to the conclusion that the release after 3.9 will probably just be 3.10. However, a 4.0 will presumably still happen some day, and the premise of this article is expected to hold for that release: it will be held to the same backwards compatibility obligations as a Python 3.X to 3.X+1 update.

So how will Python continue to evolve?

First and foremost, nothing has changed about the Python Enhancement Proposal process - backwards compatible changes are still proposed all the time, with new modules (like asyncio) and language features (like yield from) being added to enhance the capabilities available to Python applications. As time goes by, Python 3 will continue to pull further ahead of Python 2 in terms of the capabilities it offers by default, even if Python 2 users have access to equivalent capabilities through third party modules or backports from Python 3.

Competing interpreter implementations and extensions will also continue to explore different ways of enhancing Python, including PyPy's exploration of JIT-compiler generation and software transactional memory, and the scientific and data analysis community's exploration of array oriented programming that takes full advantage of the vectorisation capabilities offered by modern CPUs and GPUs. Integration with other virtual machine runtimes (like the JVM and CLR) is also expected to improve with time, especially as the inroads Python is making in the education sector are likely to make it ever more popular as an embedded scripting language in larger applications running in those environments.

For backwards incompatible changes, PEP 387 provides a reasonable overview of the approach that was used for years in the Python 2 series, and still applies today: if a feature is identified as being excessively problematic, then it may be deprecated and eventually removed.

However, a number of other changes have been made to the development and release process that make it less likely that such deprecations will be needed within the Python 3 series:

  • the greater emphasis on the Python Package Index, as indicated by the collaboration between the CPython core development team and the Python Packaging Authority, as well as the bundling of the pip installer with Python 3.4+, reduces the pressure to add modules to the standard library before they're sufficiently stable to accommodate the relatively slow language update cycle
  • the "provisional API" concept (introduced in PEP 411) makes it possible to apply a "settling in" period to libraries and APIs that are judged likely to benefit from broader feedback before offering the standard backwards compatibility guarantees
  • a lot of accumulated legacy behaviour really was cleared out in the Python 3 transition, and the requirements for new additions to Python and the standard library are much stricter now than they were in the Python 1.x and Python 2.x days
  • the widespread development of "single source" Python 2/3 libraries and frameworks strongly encourages the use of "documented deprecation" in Python 3, even when features are replaced with newer, preferred, alternatives. In these cases, a deprecation notice is placed in the documentation, suggesting the approach that is preferred for new code, but no programmatic deprecation warning is added. This allows existing code, including code supporting both Python 2 and Python 3, to be left unchanged (at the expense of new users potentially having slightly more to learn when tasked with maintaining existing code bases).

From (mostly) English to all written languages

It's also worth noting that Python 3 wasn't expected to be as disruptive as it turned out to be. Of all the backwards incompatible changes in Python 3, many of the serious barriers to migration can be laid at the feet of one little bullet point in PEP 3100:

  • Make all strings be Unicode, and have a separate bytes() type. The new string type will be called 'str'.

PEP 3100 was the home for Python 3 changes that were considered sufficiently non-controversial that no separate PEP was considered necessary. The reason this particular change was considered non-controversial was because our experience with Python 2 had shown that the authors of web and GUI frameworks were right: dealing sensibly with Unicode as an application developer means ensuring all text data is converted from binary as close to the system boundary as possible, manipulated as text, and then converted back to binary for output purposes.

Unfortunately, Python 2 doesn't encourage developers to write programs that way - it blurs the boundaries between binary data and text extensively, and makes it difficult for developers to keep the two separate in their heads, let alone in their code. So web and GUI framework authors have to tell their Python 2 users "always use Unicode text. If you don't, you may suffer from obscure and hard to track down bugs when dealing with Unicode input".

Python 3 is different: it imposes a much greater separation between the "binary domain" and the "text domain", making it easier to write normal application code, while making it a bit harder to write code that works with system boundaries where the distinction between binary and text data can be substantially less clear. I've written in more detail elsewhere regarding what actually changed in the text model between Python 2 and Python 3.

This revolution in Python's Unicode support is taking place against a larger background migration of computational text manipulation from the English-only ASCII (officially defined in 1963), through the complexity of the "binary data + encoding declaration" model (including the C/POSIX locale and Windows code page systems introduced in the late 1980's) and the initial 16-bit only version of the Unicode standard (released in 1991) to the relatively comprehensive modern Unicode code point system (first defined in 1996, with new major updates released every few years).

Why mention this point? Because this switch to "Unicode by default" is the most disruptive of the backwards incompatible changes in Python 3 and unlike the others (which were more language specific), it is one small part of a much larger industry wide change in how text data is represented and manipulated. With the language specific issues cleared out by the Python 3 transition, a much higher barrier to entry for new language features compared to the early days of Python and no other industry wide migrations on the scale of switching from "binary data with an encoding" to Unicode for text modelling currently in progress, I can't see any kind of change coming up that would require a Python 3 style backwards compatibility break and parallel support period. Instead, I expect we'll be able to accommodate any future language evolution within the normal change management processes, and any proposal that can't be handled that way will just get rejected as imposing an unacceptably high cost on the community and the core development team.

Some Suggestions for Teaching Python

I recently had the chance to attend a Software Carpentry bootcamp at the University of Queensland (as a teaching assistant), as well as seeing a presentation from one of UQ's tutors at PyCon Australia 2014.

While many of the issues they encountered were inherent in the complexity of teaching programming, a few seemed like things that could be avoided.

Getting floating point results from integer division

In Python 2, integer division copies C in truncating the answer by default:

    $ python -c "print(3/4)"
    0

Promoting to floating point requires type coercion, a command line flag or a future import:

    $ python -c "print(float(3)/4)"
    0.75
    $ python -Qnew -c "print(3/4)"
    0.75
    $ python -c "from __future__ import division; print(3/4)"
    0.75

Python 3 just does the right thing by default, so one way to avoid the problem entirely is to teach Python 3 instead of Python 2:

    $ python3 -c "print(3/4)"
    0.75

(In both Python 2 and 3, the // floor division operator explicitly requests truncating division when it is desired)

Common Python 2/3 syntax for printing values

I've been using Python 2 and 3 in parallel for more than 8 years now (while Python 3.0 was released in 2008, the project started in earnest a couple of years earlier than that, while Python 2.5 was still in development).

One essential trick I have learned in order to make regularly switching back and forth feasible is to limit myself to the common print syntax that works the same in both versions: passing a single argument surrounded by parentheses.

$ python -c 'print("Hello world!")'
Hello world!
$ python3 -c 'print("Hello world!")'
Hello world!

If I need to pass multiple arguments, I'll use string formatting, rather than the implicit concatenation feature.

$ python -c 'print("{} {}{}".format("Hello", "world", "!"))'
Hello world!
$ python3 -c 'print("{} {}{}".format("Hello", "world", "!"))'
Hello world!

Rather than doing this, the Software Carpentry material that was used at the bootcamp I attended used the legacy Python 2 only print syntax extensively, causing examples that otherwise would have worked fine in either version to fail for students that happened to be running Python 3. Adopting the shared syntax for printing values could be enough to make the course largely version independent.

Distinguishing between returning and printing values

One problem noted both at the bootcamp and by presenters at PyCon Australia was the challenge of teaching students the difference between printing and returning values. The problem is the "Print" part of the Read-Eval-Print-Loop provided by Python's interactive interpreter:

>>> def print_arg(x):
...     print(x)
...
>>> def return_arg(x):
...     return x
...
>>> print_arg(10)
10
>>> return_arg(10)
10

There's no obvious difference in output at the interactive prompt, especially for types like numbers where the results of str and repr are the same. Even when they're different, those differences may not be obvious to a student:

>>> print_arg("Hello world")
Hello world
>>> return_arg("Hello world")
'Hello world'

While I don't have a definitive answer for this one, an experiment that seems worth trying to me is to teach students how to replace sys.displayhook. In particular, I suggest demonstrating the following change, and seeing if it helps explain the difference between printing output for display to the user and returning values for further processing:

>>> def new_displayhook(obj):
...     if obj is not None:
...         print("-> {!r}".format(obj))
...
>>> import sys
>>> sys.displayhook = new_displayhook
>>> print_arg(10)
10
>>> return_arg(10)
-> 10

Understanding the difference between printing and returning is essential to learning to use functions effectively, and tweaking the display of results this way may help make the difference more obvious.

Addendum: IPython (including IPython Notebook)

The initial examples above focused on the standard CPython runtime, include the default interactive interpreter. The IPython interactive interpreter, including the IPython Notebook, has a couple of interesting differences in behaviour that are relevant to the above comments.

Firstly, it does display return values and printed values differently, prefacing results with an output reference number:

In [1]: print 10
10

In [2]: 10
Out[2]: 10

Secondly, it has an optional "autocall" feature that allows a user to tell IPython to automatically add the missing parentheses to a function call if the user leaves them out:

$ ipython3 --autocall=1 -c "print 10"
-> print(10)
10

This is a general purpose feature that allows users to make their IPython sessions behave more like languages that don't have first class functions (most notably, IPython's autocall feature closely resembles MATLAB's "command syntax" notation for calling functions).

It also has the side effect that users that use IPython, have autocall enabled, and don't use any of the more esoteric quirks of the Python 2 print statement (like stream redirection or suppressing the trailing newline) may not even notice that print became an ordinary builtin in Python 3.