Correcting ignorance: learning a bit about Ruby blocks

Gary Bernhardt pointed out at PyCodeConf that I didn't know Ruby even half as well as I should if I wanted to really understand why Ruby programmers rave about blocks so much (I started this before his talk, but it touches on his key point about the centrality of blocks to Ruby's design, and Python's lack of a similarly endemic model for code interleaving). So I set about trying to fix that (at least, to the extent I can in 24 hours or so). Unsurprisingly (since I'm not interested in becoming a Ruby programmer at this point in time), I approached this task more in terms of what it could teach me about Python (and its limitations) rather than in figuring out the full ins and outs of idiomatic Ruby. So feel free to bring it up in the comments if you think I've fundamentally mischaracterised some aspect of Ruby here.

The first distinction: two kinds of function

It turns out the first distinction shows up at quite a fundamental level. Ruby has two kinds of function: named methods and anonymous procedures. The semantics of these are quite different, most notably that named methods create their own local namespace, while anonymous procedures just use the namespace of the method that created them (so they're almost like ordinary local code).

Python also has two kinds of function: ordinary functions and generator functions. The name binding semantics are identical, but the invocation style and semantics are very different. Lambda expressions and generator expressions provide syntax for defining these inside an expression, but under the hood the semantics are still the same as those of the statement versions.

The closest you can get to a Ruby style anonymous procedure in Python is to create a named inner function and declare every otherwise local variable explicitly 'nonlocal' (in Python 3 - nonlocal declarations aren't available in Python 2). Then all name binding operations in the inner scope would also affect those names in the outer scope.

Actually, make that three kinds of function

The named method vs anonymous procedure distinction actually doesn't fully capture Ruby's semantics. Blocks (which is what I was most interested in learning about), add a new set of semantics that don't apply to the full object versions: they not only use the namespace of the defining method for their local variables, but their parameters are pass-by-reference (so they can rebind names in the calling namespace) and their control flow can affect the calling method (i.e. a return from a block will cause the calling method to return, not just the block itself). While somewhat interesting, I don't think these are actually all that significant - the core semantic difference is the one between Ruby's anonymous closures and Python's generators, not the dynamic binding behaviour of blocks.

The implications: blocks versus coroutines

This initial difference in the object model for code execution has created a fundamental difference in the way the two languages approach the problem of interleaving distinct pieces of code. The Ruby way is to define a separate piece of the current function that can be passed to other code and invoked as if it was still inline in its original location, then resuming execution when the called operation is complete. The Python way is to suspend execution, hand control back to the invoking piece of code, and then resume execution of the current code block at a later time (as determined by the invoking code).

Hence, where Ruby has specific syntactic sugar for passing a block of code to another method (do-end), Python instead has syntactic sugar for various invocation styles for coroutines (iteration via for loops, transactional code via with statements).

It's also the case that coroutines are not (yet) as deeply bound into Python's semantics as blocks are into Ruby. Whereas Ruby had blocks from the beginning and defined key programming constructs in terms of them (such as iteration and transactional style code via blocks), Python instead is built around various task specific protocols that may *optionally* be implemented in terms of coroutines (e.g. for loops, the iterator protocol and generators, the with statement, the context manager protocol and the contextlib.contextmanager decorator applied to a generator).

Callback programming and hidden control flow

One interesting outgrowth of the Ruby approach is that callback programming actually becomes a fairly natural extension of the way the language works - since programming with blocks is callback style programming, the invoking code doesn't really care if the called method runs the passed in block immediately or at some later time. Whether you consider this a good thing or a bad thing is going to depend on how you feel about the merits and dangers of hidden control flow.

During the discussions that led to the introduction of the with statement in Python 2.5, Guido made a clear, conscious design decision: he wanted the possible flows of control through the function body to be visible locally inside a function, without being dependent on the definitions of other methods (raising exceptions, of course, being an exception - catching them, though, largely obeys this guideline). Most code is run immediately, code in if statements and exception handlers is run zero or one times, code in loops is run zero or more times, code in nested function definitions is executed at some later time when the function is called. The Ruby blocks design is the antithesis of this: your control flow is entirely dependent on the methods you call. The downside of wanting visible control flow, of course, is that iteration, transactional code and callback programming all end up looking different at the point of invocation. (If you read PEP 340, Guido's original proposal for what eventually became the with statement, and contrast it with PEP 343, the version that we finally implemented, you'll see that his original idea was a fair bit closer to Ruby's blocks in power and scope).

So Ruby's flexibility comes at a price: when you pass a block to a method, you need to know what that method does in order to know how it affects your local control flow. Naming conventions can help reduce that complexity (such as the .each convention for iteration), but it does move control flow into the domain of programming conventions rather than the language definition.

On the other hand, Python's choice of explicit control flow comes at a price in flexibility: callback programming looks starkly different to ordinary programming as you have to construct explicit closures in order to pass chunks of code around.

Two way data flow

With their functional API, blocks natively supported two-way data flow from the beginning: data was passed in by calling them, and then either returned as the result of the block or by manipulating the passed in name bindings.

By contrast, Python's generators were originally output only, reflecting their target use case of iteration. You could input some initial data via parameters, but couldn't readily supply data to a running calculation. This has started to change in recent years, as generators now provide send() and throw() methods to pass data back in, and yield became an expression in order to provide access to the 'send()' argument. However, these features do not, at this stage, have deep syntactic support - there's a fairly obvious mapping from continue to send() and break to throw() that would tie them into the for loop syntax, but this capability has not garnered significant support when it has been brought up (I believe because it doesn't really help with the last major code execution model that Python doesn't provide nice native support for: callback programming).

In Python 3.3, generators will gain the ability to return values, and better syntax for invoking them and getting that value, moving the language even further towards full coroutine support (see PEP 380 for details). However, that is merely the next step along the path rather than arrival at the destination.

Reinventing blocks

I think the folks who accuse us of (slowly) reinventing blocks have a valid point - Python really is on the path of devising ways to handle tasks neatly with coroutines (i.e. functions that can be suspended and transparently resumed later without losing any internal state) that Ruby handles via blocks (i.e. extracting arbitrary fragments of a function body and passing them to other code). The fact that generators were not built into Python from the outset but instead have been added later to make certain kinds of code easier to write does show through in a variety of ways - coroutine based code often doesn't play nicely with ordinary imperative code and vice-versa.

Ruby's way has a definite elegance to it (despite the hidden control flow). I think aspiring to that kind of elegance for callback programming in Python would be a good thing, even if the semantic model is completely different (i.e. coroutine based rather than block based). The addition of actual block functionality remains unlikely, however - if they were as powerful as Ruby blocks, then it would create two ways to do too many things (with no obvious criteria to choose between the current technique and the block based technique), but if they were strictly less powerful, then reusing Ruby's block terminology would likely be confusing rather than enlightening. For better or for worse, Python is now well down the path of coroutine based programming and we likely need to see how far we can take that model rather than trying to shoehorn in yet another approach.

Spinning up the pulpdist project

One novel aspect of the pulpdist project is that it is starting with an almost completely blank slate from a technology point of view (aside from the decision to use Pulp as the main component of the mirroring network). Red Hat does have development standards for internal projects, of course (especially in the messaging space), but they're fairly flexible, leaving the individual tool development teams with a lot of options. If something ships with Fedora and/or RHEL, or is available under licensing terms that would be acceptable for inclusion in Fedora (and subsequently RHEL), then it's fair game.

This post focuses on the design of the management server. I'll write up a separate post looking at the currently planned design for the Pulp data transfer plugins.

Source Control

Unsurprisingly, Red Hat's internal processes are heavily influenced by Linux kernel processes. Accordingly, the source control tool of choice for new projects is Git. While I have a slight preference for Mercurial (due mainly to familiarity), I'm happy enough with any DVCS, so Git it is.

Primary Development Language

Python, of course. You don't hire a CPython core developer to get them to work on a Ruby or Perl project (although the current system I'm replacing was written in Perl). As a web application, there will naturally be some Javascript and CSS involved as well.

Web Framework

The main management application for pulpdist is going to be a full-scale web application. User profiles and authentication, database storage, communication with other web services, provision of a REST API, integration with the engineering tools messaging bus. Basically, micro-frameworks need not apply.

While I expect Pyramid/Pylons would also have been able to do the job, I decided to go with Django 1.3. This was heavily influenced by social factors: I know a lot of Django devs that I can bug for advice, but the same is not true for Pyramid. The complexity of the whole Pyramid/Pylons/TurboGears setup is also not appealing - while veteran web developers may find the "you decide" approach a selling point, Django's batteries included approach makes it far simpler to get started quickly, and decide as I go along which pieces I should keep, discard or replace.

I've heard some experienced Django developers muttering complaints about the class based views design in 1.3, but as someone coming in that is an experienced Python developer, but a relatively noobish web developer, the CBV approach seems eminently sensible, while the old function based approach looks repetitive and insane. Object oriented programming was invented for a reason!

I'll admit that my perception may be biased by knowing exactly how to make multiple inheritance work the way I want it to, though :)

Web Server

The management server doesn't actually have that much work to do, so the basic Apache+mod_wsgi configuration will serve as an adequate starting point (any heavy lifting will be done by the individual Pulp instances, and the main data traffic on those doesn't run through their web service). WSGI provides the flexibility to revisit this later if needed.

I've also punted on any web caching questions for now - the management server is low traffic and once the access to the Pulp sites is pushed out to a backend service, it should be fast enough at least for the early iterations.

Authentication & Authorisation

The actual user authentication task will be handed off to Apache and all management application access will be restricted to Kerberos authenticated users over SSL. Django's own permissions systems will be used to handle authorisation restrictions. (The experimental prototype will use Basic Auth instead, since it is the Apache/Django integration the prototype needs to cover, not the Apache configuration for SSL and Kerberos authentication)

Integration with Pulp's user access controls is via OAuth, but the design for configuration of user permissions in the Pulp servers is still TBD.

Database and ORM

Again, the management server isn't doing the heavy lifting in this application. The Pulp instances use MongoDB, but for the management server I currently plan to use the standard Django ORM backed by PostgreSQL. For the prototype instance, the database is actually just an SQLite3 file. I'm not quite sold on this one as yet - it's tempting to start playing with SQLAlchemy, since I've already had to hack around some of the limitations in the native ORM in order to store encrypted fields. OTOH, I already have a ton of things to do on this project, so messing with this is a long way down the priority list.

Schema and data maintenance is handled using South.

HTML Templating

The standard Django templating engine should be sufficient for my needs. As with the ORM, it's tempting to look into upgrading it to something like Jinja2, but once again 'good enough' is likely to be the deciding factor.

For data table display, I'm using Django Tables 2 and form display will use Django Uni-Form.

REST API

The REST API for the service is currently there primarily as a development aid - it lets me publish the full data model to the web as soon as it stabilises (and even while its still in flux), even if the UI for end users hasn't been fully defined. This is particularly useful for the metadata coming back from the Pulp server, since it doesn't need much post-processing to be included as raw data in the management server's own REST API. The JSON interface will also allow much of the backend processing to be fully exercised by the test suite without worrying about web UI details.

The design of the REST API was heavily influenced by this Lessons Learned piece from the RHEV-M developers. The Django Rest Framework means I can just define the data I want to display as a list or dictionary and the framework takes care of formatting it nicely, including rendering URLs as hyperlinks.

AMQP Messaging

I haven't actually started on this aspect in any significant way, but the two main contenders I've identified are python-qpid (which is what Pulp uses) and django-celery (which would also give me an internal task queue engine, which the management server is going to need - the prototype just does everything in the Django process, which is OK for experimentation on the LAN, but clearly inadequate long term when talking to multiple sites distributed around the planet). At this early stage, I expect the internal task management aspect is going to tip the decision in favour of the latter.

Testing Regime

As the foundation for the automated testing, I'm going with Django Sane Testing (mainly based on the example of other internal Django projects). Michael Foord's mock module lets me run at least some of the tests without relying on an external Pulp instance (fortunately, the namespace conflict with Fedora's RPM building utility 'mock' was recently resolved with the latter's support library being renamed to 'mockbuild').

Continuous integration is an open question at this point. Pulp uses Jenkins for CI and I'm inclined to follow their lead. The other main possibility is to use Beaker, Red Hat's internal test system originally set up for kernel testing (one key attraction Beaker offers is the ability to set up multi-server multi-site testing in a test recipe so I can run tests over the internal WAN).

Packaging

Tito is a tool for generating SRPMs and RPMs directly from a Git repository. For my own packages, this is the approach I'm using (with handcrafted spec files). For some strange reason, the sysadmins around here like it when internal devs provide things as pre-packaged RPMs for deployment :)

Packaging of upstream PyPI dependencies that aren't available as Fedora or RHEL packages is still a work in progress. I experimented with Tito and git submodules (which doesn't work) and git subtrees (which does work, but is seriously ugly). My next attempt is likely to be based on py2pack, so we'll see how that goes (I actually discovered that project by searching for 'cpanspec pypi' after hearing some of the Perl folks here extolling the virtues of cpanspec for easily packaging CPAN modules as RPMs).

I also need to switch to using virtualenv to get a clearer distinction between Fedora packages I added via yum install and stuff I picked up directly from PyPI with pip.


Mirror All The Things!

After describing the project I'm working on to a few people at PyConAU and BrisPy, I decided it might be a good idea to blog about it here. I do have a bit of an ulterior motive in doing so, though - I hope people will point out when I've missed useful external resources or applications, or when something I'm planning to do doesn't make sense to the assorted Django developers I know. Yes, that's right - I'd like to make being wrong on the internet work in my favour :)

The project is purely internal at this stage, but I hope to be able to publish it as open source somewhere down the line. Even being able to post these design concepts is pretty huge for me personally, though - before starting with Red Hat a few months ago, I spent the previous 12 and a half years working in the defence industry, which is about as far from Red Hat's "Default to Open" philosophy as it's possible to get.

Mirror, Mirror, On The Wall

The project Red Hat hired me to implement is the next generation of their internal mirroring system, which is used for various tasks, such as getting built versions of RHEL out to the hardware compatibility testing labs (and, when they're large enough, returning the generated log files to the relevant development sites), or providing internal Fedora mirrors at the larger Red Hat offices (such as the one here in Brisbane).

There are various use cases and constraints that mean the mirroring system needs to operate at the filesystem level without making significant assumptions about the contents of the trees being mirrored (due to various details of the use cases involved, block level replication and approaches that rely on the transferred data being laid out in specific ways aren't viable alternatives for this project). The current incarnation of this system relies almost entirely on that venerable workhorse of the mirroring world, rsync.

However, the current system is also showing its age and has a few limitations that make it fairly awkward to work with. Notably, there's no one place to go to get an overview of the entire internal mirroring setup, and the direct use of rsync means it isn't particularly friendly with other applications when it comes to sharing WAN bandwidth and the servers involved are wasting quite a few cycles recalculating the same deltas for multiple clients. Hence, the project I am working on, which is intended to replace the existing system with something a bit more efficient and easier to manage, while also providing a better platform for adding new features.

Enter Pulp

Pulp is an open source (Python) project created by Red Hat to make it easier to manage private yum repositories. Via Katello, Pulp is one of the upstream components for Red Hat's CloudForms product.

The Pulp project is currently in the process of migrating from their original yum-specific architecture to a more general purpose Generic Content plugin architecture. It's that planned plugin architecture that makes Pulp a useful basis for the next generation internal mirroring system, which, at least for now, I am imaginatively calling pulpdist (referring to both "distribution with Pulp", since that's what the system does, and "distributed Pulp instances", since that's how the system will work).

The main components of the initial pulpdist architecture will be:
  • a front-end (Django 1.3) web app providing centralised management of the entire distribution network
  • custom importer and distributor plugins for Pulp to handle distribution of tree changes within the distribution network
  • custom importer plugins to handle the import of trees from their original sources and generation of any additional metadata needed by the internal distribution plugins
  • generic (and custom, if needed) plugins to make the trees available to the applications that need them

I'll be writing more on various details that I consider interesting as I go along. Initially, that will include my plan for the mirroring protocol to be used between the sites, as well as various decisions that need to be made when spinning up a Django project from scratch (while many of my specific answers are shaped by the target environment for internal deployment, the questions I needed to consider should be fairly widely applicable).

Open Source, Windows and Teaching Python to New Developers

A few questions and incidents recently prompted me to reflect on why I don't help with CPython support on Windows, even though I use Windows happily enough on my gaming system. Since this ended up being a rather pro-Linux article and upfront disclosure is a good thing, I'll note that while I do work for Red Hat now, that's a very recent thing - my adoption of Linux as my preferred development platform dates back to 2004 or so. I work for Red Hat because I like Linux, not the other way around :)

The Availability of Professional Development Tools

I don't make a secret of my dislike of Windows as a hobbyist development platform. While Microsoft have improved things in recent years (primarily by releasing the Express editions of Visual Studio), there's still a huge difference between an operating system like GNU/Linux, which was built by developers for developers based on a foundation that was built by academics for academics, and Windows, which was built by a company that used deals with computer manufacturers to get it into end users' hands regardless of technical merit. Developers were forced to follow in order to reach that large installed user base. Those different histories are reflected in the different development cultures that surround the respective platforms.

To get the same tool chain that professional Linux companies use, you don't need to do anything special - Linux distributions include the tools used to create them. If you have a distribution, you have everything you need to build applications for that distribution, including documentation. With the open source nature of the platform and almost all of the software (the occasionally binary driver notwithstanding), there's a vast range of tools out there to help you get things done (although sorting through the mass can be a little tricky sometimes, since it can be hard to tell the difference between stuff that doesn't exist and stuff that exists, but hasn't been uncovered by your research).

As far as I'm aware, Mac OS X isn't quite as generous with freely available development utilities, but isn't all that far off the Linux approach (I'm not a Mac user or developer though, so there may be more hurdles than I am aware of - I recall some muttering about Apple beginning to charge a small fee for XCode. My opinion is based mostly on the fact that it seems pretty easy to find open source devs that use Macs). With the POSIX-ish underpinnings, many of the utilities from the *nix world also work in this environment.

The minimum realistic standard for professional Windows development, though, is an MSDN subscription (to get full access to the OS documentation and various utilities), along with a professional copy of Visual Studio. The tools available for free (including the Express editions of Visual Studio) are clearly second rate. Even when the tools themselves are OK, the licensing restrictions on the applications they create may make them practically useless (and MS have the gall to call the GPL viral - at least the gcc team don't restrict how you license and distribute the binaries it creates). So why should a hobbyist develop for a system that thinks they should pay substantial sums for the privilege of developing for it, instead of one that welcomes all contributors, providing not only the end product, but the ingredients and recipes all for free?

At the recent PyConAU sprints, one of the contributors (an existing Linux user that happened to have a Windows only laptop with them) became frustrated with getting all the necessary tools set up to work properly on Windows (configuring git+ssh for read/write access to a GitHub repo was one key point of irritation), and decided to dual boot Ubuntu on the machine instead. Twenty minutes later, she was up and running and hacking on the project she originally wanted to hack on. Granted, she already knew how to use Linux, but seriously, there's something fundamentally wrong with a platform when installing and dual-booting to a different OS is the easiest way to get a decent development environment up and running.

All that ends up putting cross-platform languages like Python in an interesting position: when developing with Python, you can often get away with not understanding the underlying details of your operating system, because the language runtime tries to provide a largely standardised interface on all platforms. However, many open source developers either don't use Windows at all, or genuinely dislike programming for it, so the burden of making things work properly on Windows falls on the shoulders of a comparatively small number of people, either those who genuinely like programming for the platform (yes, such people exist, I'm just not one of them), or those that are looking for any niche where they can usefully contribute and are happy enough to take on the task of improving Windows compatibility and support.

I don't have particularly hard numbers to back this up (other than the skew in core developer numbers vs overall OS popularity), but my intuition is that, at least for CPython, the user:core developer ratio is orders of magnitude higher for Windows than it is for Linux or Mac OS X.

The Implications for Teaching Python on Windows

Something cool that is going on at the moment is that a lot of folks are interested in the idea of teaching more people how to program with Python as the language used. However, the potential students (young and old) that they are wanting to teach often don't have any development experience at all and are using the most common consumer operating system (i.e. Windows). So good Windows support, and an easy installation experience are important considerations for these instructors. A request that is frequently made (with varying levels of respect and politeness), is that the official python.org Windows installer be updated to automatically adjust the PATH (or at least provide the option to do so), so that Python can be launched from the command line by typing "python" instead of something like "C:\Python27\python".

If educators want that right now their best bet is actually to direct their students towards the Windows versions of ActiveState's ActivePython Community Edition. ActiveState add a few things to the standard installer, like PATH manipulation and additional packages (such as pywin32). They also bundle PyPM, which is a decent tool for getting PyPI packages on to Windows machines (at least, I've heard good things about it - I haven't actually used it myself). (That said, I believe I may need to caveat that recommendation a bit: as near as I can tell from their website, PyPM has been deliberately disabled for their 64-bit Windows Community Edition installer. Still, even in that case, you can easily grab additional packages direct from PyPI via "pip install" on the command line)

Brian Curtin is working on adding optional PATH manipulation to the python.org installer for 3.3, and there's a chance such a change might be backported to the next maintenance releases for 3.2 and 2.7 (no promises, though). Even if it does make it in, it will still be a while before the change is part of a binary release (especially given that Brian has only just started tinkering with it).

This is clearly a nice thing for beginners, especially those that aren't in the habit of tinkering with their OS settings, but I do honestly wonder how much of a difference it will make in the long run. In many ways, software development is one long exercise in frustration. You decide you want to fix bug X. But it turns out bug X is really due to bug Y. You could work around Y just to fix X, but the bigger bug would still be there. But then you discover that fixing bug Y properly requires feature Z, which doesn't exist yet, so a workaround (even an ugly one) starts to sound pretty attractive. "Yak shaving" (the highly technical term for things that you're working on solely because they're a prerequisite for what you actually want to be working on) is so common it's almost the norm rather than the exception. The many and varied frustrations of trying to use Windows as a hobbyist open source developer also won't magically go away just because the python.org installer starts automating one environment variable update - as soon as people are introduced to sites like GitHub and BitBucket, they'll get to discover the joy that is SSH and source control on Windows. If they get past that hurdle, they'll likely start to encounter the multitude open source projects that don't even offer Windows installers (if the project supports Windows at all), because their Windows developer count stands at a grand total of zero.

Final Thoughts

I hope the people teaching Python to beginners on Windows and the folks working on improving Windows support don't take this article as an attack on their efforts. I find both goals to be quite admirable, and wish those involved all the success they can find. But there are reasons I abandoned Windows as a personal development platform ~7 years ago and taught myself to use Linux instead. As far as I can tell, most of those reasons remain valid today, even after Microsoft started releasing the Express versions of Visual Studio in an attempt to stem the flood of hobbyist developers jumping ship.

The other day I called the relative lack of Windows developers in open source a vicious cycle and I stand by that. If someone can learn to program, mastering Linux is going to be comparatively easy. For anyone seriously interested in open source development, using Linux (even in a virtual machine, the way I do on my gaming laptop) is by far the path of least resistance. Getting more Windows developers in open source requires that people care sufficiently about Windows as a platform that they don't just switch to Linux, but care about open source enough to start contributing at all, and that seems to be a genuinely rare combination.

Scripting languages and suitable complexity

Steven Lott is a Python developer and blogger that I first came across via his prolific contributions to answering questions on Stack Overflow, and then later by reading his blog posts that appeared on Planet Python. His Building Skills in Python book is a resource I've recently started suggesting newcomers to Python take a look at to see if his style works for them.

Some time ago, he posted about what he called the "Curse of Procedural Design": the fact that, beyond a certain point, purely procedural code typically starts drowning in complexity and becomes an unmaintainable mess. Based on that, he then started questioning whether or not he was doing the right thing by teaching the procedural aspects of Python first and leaving the introduction of object oriented concepts until later in the book.

Personally, I think starting with procedural programming is still the right thing to do. However, one of the key points of object-oriented design is that purely procedural programming doesn't scale well. Even in a purely procedural language, large programs almost always define essential data structures, and functions that work on those data structures, effectively writing object oriented code by convention. Object support built into the language makes this easier but it isn't essential (as large C projects like the Linux kernel and CPython itself demonstrate).

Where languages like Java can be an issue as beginner languages is that by requiring that all code be compiled in advance and written in an object oriented style, they set a minimum level of complexity for all programs written in that language. Understanding even the most trivial program in Java requires that you grasp the concepts of a module, a class, an instance, a method and an expression. Using a compiled procedural language instead at least lets you simplify that a bit, as you only need to understand modules, functions and expressions.

But the scripting languages? By means of a read-eval-print loop in an interactive interpreter, they let you do things in the right order, starting with the minimum level of complexity possible: a single expression.

For convenience, you may then introduce the concept of a 'script' early, but with an appropriate editor, scripts may be run directly in the application (giving an experience very similar to a REPL prompt) rather than worrying about command line invocation.

More sophisticated algorithms can then be introduced by discussing conditional execution and repetition (if statements and loops), but still without any need to make a distinction between "definition time" and "execution time".

Then, once the concept of algorithms has been covered, we can start to modularise blocks of execution as functions and introduce the idea that algorithms can be stored for use in multiple places so that "definition time" and "execution time" may be separated.

Then we start to modularise data and the associated operations on that data as classes, and explore the ways that instances allow the same operations to readily be performed on different data sets.

Then we start to modularise collections of classes (and potentially data and standalone functions) as separate modules (and, in the case of Python, this can be a good time to introduce the idea of "compilation time" as separate from both "definition time" and "execution time").

Continuing up the complexity scale, modules may then be bundled into packages, and packages into frameworks and applications (introducing "build time" and "installation time" as two new potentially important phases in program execution).

A key part of the art of software design is learning how to choose an appropriate level of complexity for the problem at hand - when a problem calls for a simple script, throwing an entire custom application at it would be overkill. On the other hand, trying to write complex applications using only scripts and no higher level constructs will typically lead to an unmaintainable mess.

In my opinion, the primary reason that scripting languages are easier to learn for many people is that they permit you to start immediately with code that "does things", allowing the introduction of the "function" and "class" abstractions to be deferred until later.

Starting with C and Java, on the other hand, always requires instructors to say "Oh, don't worry about that boilerplate, you'll learn what it means later" before starting in with the explanation of what can go inside a main() function or method. The "compilation time" vs "execution time" distinction also has to be introduced immediately, rather than being deferred until some later point in the material. There's also the fact that such languages are actually usually at least two languages in one: the top level "compile time" language that you use to define your data structures, functions and modules and the "run time" language that you actually use inside functions and methods to get work done. Scripting languages don't generally have that distinction - the top level language is the same as the language used inside functions (in fact, that's my main criteria for whether or not I consider a language to be a scripting language in the first place).

Of Python and Road Maps (or the lack thereof)

I gave my first ever full conference talk at PyCon AU this year, on the topic of how Python evolves over time.

One notable absence from that talk is a discussion of any kind of specific 'road map' for the language as a whole. There's a reason for that: no such document exists. Guido van Rossum (as the language creator and Benevolent Dictator for Life) is the only person who could write an authoritative one and he's generally been happy to sit back and let things evolve organically since guiding the initial phases of the Py3k transition (which is now ticking along quite nicely and the outlook for 2.x being relegated to 'supported-but-now-legacy' status within the 2013/14 time frame is promising). PEP 398, the release PEP for Python 3.3 comes close to qualifying, but is really just a list of PEPs and features, with no context for why those ideas are even on the list.

However, despite that, it seems worthwhile to make the effort to try to write a descriptive road map for where I personally see current efforts going, and where I think the Python core has a useful role to play. Looking at the Python Enhancement Proposal index for the lists of Accepted and Open PEPs can be informative, but it requires an experienced eye to know which proposals are currently being actively championed and have a decent chance of being accepted and included in a CPython release.

When reading the following, please keep in mind that the overall Python community is much bigger than the core development team, so there are plenty of other things going on out there. Others may also have differing opinions on which changes are most likely to come to fruition within the next few years.

Always in Motion is the Future

Unicode is coming. Are you ready?

The Python 2.x series lets many developers sit comfortably in a pure ASCII world until the day when they have to debug a bizarre error caused by their datastream being polluted with non-ASCII data.

That is only going to be a comfortable place to be for so long. More and more components of the computing environment are migrating from ASCII to Unicode, including filesystems, some network protocols, user interface elements, other languages and various other operating system interfaces. With the modern ease of international travel and the rise of the internet, even comfortable "we only speak ASCII in this office, dammit!" environments are eventually going to have to face employees and customers that want to use their own names in their own native, non-Latin character sets rather than being constrained by a protocol invented specifically to handle English.

The Python 3.x series was created largely in response to that environmental transition - in many parts of the world where English isn't the primary spoken language, needing to deal with text in multiple disparate character encodings is already the norm, and having to deal regularly with non-ASCII data is only going to become more common even in English dominant areas.

However, supporting Unicode correctly on this scale is something of a learning experience for the core development team as well. Several parts of the standard library (most notably the email package and wsgiref module) really only adjusted to the new Unicode order of things in the recent 3.2 release, and (in my opinion) we've already made at least one design mistake in leaving various methods on bytes and bytearray objects that assume the underlying data is text encoded as ASCII.

The conversion to Unicode everywhere also came at a measurable cost in reduced speed and increased memory usage within the CPython interpreter. PEP 393 ("Flexible String Representation") is an attempt to recover some of that ground by using a smarter internal representation for string objects.

Easier and more reliable distribution of Python code

Tools to package and distribute Python code have been around in various stages of usability since before the creation of distutils. Currently, there is a concerted effort by the developers of tools like pip, virtualenv, distutils2, distribute and the Python Package index to make it easier to create and distribute packages, to provide better metadata about packages (without requiring execution of arbitrary Python code), to install packages published by others, and to play nicely with distribution level package management techniques.

The new 'packaging' module that will arrive in 3.3 (and will be backported to 3.2 and 2.x under the 'distutils2' name) is a visible sign of that, as are several Python Enhancement Proposals related to packaging metadata standards. These days, the CPython interpreter also supports direct execution of code located in a variety of ways.

There's a significant social and educational component to this effort in addition to the technical underpinnings, so this is going to take some time to resolve itself. However, the eventual end goal of a Python packaging ecosystem that integrates nicely with the various operating system software distribution mechanisms without excessive duplication of effort by developers and software packagers is a worthy one.

Enhanced support for various concurrency mechanisms

The addition of the 'concurrent' package, and its sole member, 'concurrent.futures' to the Python 3.2 standard library can be seen as a statement of intent to provide improved support for a variety of concurrency models within the standard library.

PEP 3153 ("Asynchronous IO support") is an early draft of an API standardisation PEP that will hopefully provide a basis for interoperability between components of the various Python asynchronous frameworks. It's a little hard to follow for non-Twisted developers at this stage, since there's no specific proposed stdlib API, nor any examples showing how it would allow things like plugging a Twisted protocol or transport into the gevent event loop, or vice-versa.

While developers are still ultimately responsible for understanding the interactions between threading and multiprocessing on their platform when using those modules, there are things CPython could be doing to make it easier on them. Jesse Noller, the current maintainer of the multiprocessing package, is looking for someone to take over the responsibility for coordinating the related development efforts (between cross-platform issues and the vagaries of os.fork interactions with threading, this is a non-trivial task, even with the expertise already available amongst the existing core developers).

PEP 3143 ("Standard daemon process library") may be given another look for 3.3, although it isn't clear if sufficient benefit would be gained from having that module in the stdlib rather than published via PyPI as it is currently.

Providing a better native Python experience on Windows

Windows is designed and built on the basis of a mindset that expects most end users to be good little consumers that buy software from developers rather than writing it themselves. Anyone writing software is thus assumed to be doing so for a living, and hence paying for professional development tools. Getting a decent development environment set up for free can be so tedious and painful that many developers (like me!) would rather just install and use Linux for development on our own time, even if we're happy enough to use Windows as a consumer (again, like me - mostly for games in my case). The non-existent "BUNDLE ALL THE THINGS!" approach to dependency management and the lack of any kind of consistent software updating interface just makes this worse, as does the fact that any platform API handling almost always has to special-case Windows, since their libc implementation is one of the more practically useless things on the face of this Earth.

This creates a vicious cycle in open source, where most of the active developers are using either Linux or Mac OS X, so their tools and development processes are focused on those platforms. They're also typically very command line oriented, since the diverse range of graphical environments and the widespread usage of remote terminal sessions makes command line utilities the most reliable and portable. Quite a few new developers coming into the fold run into the Windows roadblocks and either give up or switch to Linux for development (e.g. by grabbing VirtualBox and running a preconfigured Linux VM to avoid any hardware driver issues), thus perpetuating the cycle.

Personally, I blame the cross-platform hobbyist developer hostile nature of the platform on the Windows ecosystem itself and generally refuse to write code on it unless someone's paying me to do so. Fortunately for single-platform Windows users and developers, though, there are other open source devs that don't feel the same way. Some of them are working on tools (such as the script launcher described in PEP 397) that will make it easier to get usable development environments set up for new Python users.

Miscellaneous API improvements in the standard library

There are some areas of the standard library that are notoriously clumsy to use. HTTP/HTTPS comes to mind, as does date/time manipulation and filesystem access.

It's hard to find modules that strike the sweet spot of being well tested, well documented and proven in real world usage, with mature, stable APIs that aren't overly baroque and complicated (yes, yes, I realise the irony of that criticism given the APIs I'm talking about replacing or supplementing). It's even harder to find such modules in the hands of maintainers that are willing to consider supporting them as part of Python's standard library (potentially with backports to older versions of Python) rather than as standalone modules.

PEP 3151 ("Reworking the OS and IO exception hierarchy") is an effort to make filesystem and OS related exceptions easier to generate and handle correctly that is almost certain to appear in Python 3.3 in one form or another.

In the HTTP/HTTPS space, I have some hopes regarding Kenneth Reitz's 'requests' module, but it's going to take more real world experience with it and stabilisation of the API before that idea can be given any serious consideration.

The regex module (available on PyPI) is also often kicked around as a possible candidate for addition. Given the benefits it offers over and above the existing re module, it wouldn't surprise me to see a genuine push for its inclusion by default in 3.3 or some later version.

A basic stats library in the stdlib (to help keep people from implementing their own variants, usually badly) is also a possibility, but again, depends on a suitable candidate being identified and put forward by people willing to maintain it.

The RFC treadmill

Many parts of the Python standard library implement various RFCs and other standards. While our release cycle isn't fast enough to track the newer, still evolving protocols, it's good to provide native support for the latest versions of mature, widely supported ones.

For example, support for sendmg(), recvmsg() and recvmsg_into() methods will be available on socket objects in 3.3 (at least on POSIX-compatible platforms). Additions like PEP 3144 ("IP Address Manipulation Library for the Python Standard Library") are also a potential possibility (although that particular example currently appears to be lacking a champion).

Defining Python-specific interoperability standards

Courtesy of the PEP process, python-dev often plays a role in defining and documenting interfaces that allow various Python frameworks to play nicely together, even if third party libraries are needed to make full use of them. WSGI, the Web Server Gateway Interface, is probably the most well-known example of this (the current version of that standard is documented in PEP 3333). Other examples include the database API, cryptographic algorithm APIs, and, of course, the assorted standards relating to packaging and distribution.

PEP 3333 was a minimalist update to WSGI to make it Python 3 compatible, so, as of 3.2, it's feasible for web frameworks to start considering Python 3 releases (previously, such releases would have been rather challenging, since it was unclear how they should talk to the underlying web server). It likely isn't the final word on the topic though, as the web-sig folks still kick around ideas like PEP 444 and WSGI Lite WSGI Lite. Whether anything actually happens on that front or if we keep chugging along with an attitude of "eh, WSGI gets the job done, quit messing with it" is an open question (and far from my area of expertise).

One area that is being actively worked on, and will hopefully improve significantly in Python 3.3, is the low level buffer interface protocol defined by PEP 3118. It turned out that there were a few flaws in the way that PEP was implemented in CPython, so even though it did achieve the basic goal of allowing projects like NumPy and PIL to interoperate without needing to copy large data buffers around, there are still some rather rough edges in the way the protocol is exposed to Python code via memoryview objects, as well as in the respective obligations of code that produces and consumes these buffers. More details can be found on the issue tracker if you're interested in the gory details of defining conventions for reliable management of shared dynamically allocated memory in C and C++ code :)

Enhancing and refining the import system

The importer protocol defined by PEP 302 was an ambitious attempt to decouple Python's import semantics from the underlying filesystem. It never fully succeeded - there's still a lot of Python code (including one or two components in the standard library!) that assumes the classical directories-on-a-disk layout for Python packages. That package layout, with the requirement for __init__.py files and restriction to a single directory for each package, is itself surprising and unintuitive for many Python novices.

There are a few suggestions in the works ultimately aimed not only at cleaning up how all this is implemented, but also at further decoupling the module hierarchy from the on-disk file layout. There are significant challenges in doing this in a way that makes life easier for developers writing new packages, while also allowing those writing tools that manipulate packages to more easily do the right thing, but it's an area that can definitely do with some rationalisation.

Improved tools for factoring out and reusing code

Python 3.3 will at least bring with it PEP 380's "yield from" expression which makes it easy to take part of a generator and split it out into a subgenerator without affecting the overall semantics of the original generator (doing this correctly in the general case is currently significantly harder than you might think - see the PEP for details).

I suspect that the next few years may bring some more tweaks to generators, generator expressions and context managers to make for loops and with statements even more powerful utilities for factoring out code. However, any adjustments in this area will be carefully balanced against the need for language stability and keeping things reasonably easy to learn.

Not Even on the Radar

In addition to the various suggestions mentioned in PEP 3099 ("Things that Will Not Change in Python 3000"), there are a couple of specific items I think are worth calling out as explicitly not being part of any realistic road map:

Major internal architectural changes for CPython

Questions like 'Why not remove the GIL?', 'Why not switch to a register based VM?', 'Why not add a JIT to CPython?' and 'Why not make PyPy the reference interpreter?' don't really have straightforward obvious answers other than "that's harder than you might think and probably less beneficial than you might hope", so I expect people to continue asking them indefinitely. However, I also don't see any significant changes coming on any of these fronts any time soon.

One of the key advantages of the CPython interpreter as a reference implementation is that it is, fundamentally, quite a simple beast (despite a few highly sophisticated corners). If we can do things, chances are that the other implementations are also going to be able to support them. By contrast, the translation stage in their toolchain means that PyPy can contemplate features like the existing JIT, or their current exploration of Software Transactional Memory for removing their Global Interpreter Lock in a useful way that simply aren't feasible with less sophisticated tools (or a much bigger development team).

Personally, I think the status quo in this space is in a pretty good place, with python-dev and CPython handling the evolution of the language specification itself, as well as providing an implementation that will work reasonably well on almost any platform with a C compiler (and preferably some level of POSIX compliance), while the PyPy crew focus on providing a fast, customisable implementation for the major platforms without getting distracted by arguments about possible new language features.

More feature backports to the 2.x series

Aside from the serious backwards compatibility problems accompanying the Unicode transition, the Py3k transition was also strongly influenced by the concept of paying down technical debt. Having legacy cruft lying around made it harder to introduce language improvements, since the additional interactions created more complexity to deal with, to the point where people just didn't want to bother any more.

By clearing out a bunch of that legacy baggage, we created a better platform for new improvements, like more pervasive (and efficient!) Unicode support, even richer metaclass capabilities, exception chaining, more intuitive division semantics and so on.

The cost, borne mostly by existing Python developers, is a long, slow transition over several years as people deal with the process of not only checking whether the automated tools can correctly handle their own code, but also waiting for all of their dependencies to be available on the updated platform. This actually seems to be going fairly well so far, even though people can be quite vocal in expressing their impatience with the current rate of progress.

Now, all of the major Python implementations are open source, so it's certainly possible for motivated developers to fork one of those implementations and start backporting features of interest from 3.x that aren't already available as downloads from PyPI (e.g. it's easy to download unittest2 to get access to 3.x unittest enhancements, so there's no reason to fork just for that kind of backport). However, anyone doing so will be taking up the task in full knowledge of the fact that the existing CPython development team found that process annoying and tedious enough that we got tired of doing it after two releases (2.6 and 2.7).

So, while I definitely expect a certain level of ongoing griping about this point, but I don't expect it to rise to the level of anyone doing the work to backport things like function annotations or exception chaining to the 2.x series.

Update 1: Added missing section about RFC support
Update 2: Fix sentence about role of technical debt in Py3k transition
Update 3: Added a link to a relevant post from Guido regarding the challenges of maintaining official road maps vs one-off "where are we now?" snapshots in time (like this post)
Update 4: Added missing section on defining Python-specific standards

Sharing vs broadcasting

Even since I started playing with Google+ and its Circle mechanic, I've been trying to figure out why I don't like it. I thought it sounded great when I heard about it, but as soon as I started using it... meh :P

I still haven't quite figured it out (although I mostly think it's the neither-chicken-nor-fowl aspect, along with it adding back ~500 useless "suggestions" from random Python mailing list contacts that I had already purged once), but it's certainly helped me see the way I communicate online in a different light.

For a long time, my only real online presence was regular postings to Python mailing lists, commenting on various sites and the very occasional post here on B&L.

After enough of my friends joined, Facebook was added to the mix, but I do everything I can to keep that locked down so only acquaintances can see most things I post there.

After going to PyconAU last year, and in the leadup to PyCon US earlier this year, I started up my @ncoghlan_dev account on Twitter, got the "python" category tag on here added to Planet Python, and started to develop a bit more of an online public presence.

Here on the blog, I can, and do, tag each post according to areas of interest. As far as I know, the 'python' tag is the only one with any significant readership (due to the aggregation via Planet Python), but people could subscribe to the philosophy or metablogging tags if they really wanted to.

When it comes to sharing information about myself, there's really only a few levels based on how much I trust the people involved: Friends&Family, Acquaintances, General Public pretty much covers it. Currently I handle that via a locked down FB for Friends, Family & Acquaintances (with a "Limited Access" list for people that I exclude from some things, like tagged photos) and completely public material on Twitter and Blogger.

The public stuff is mostly Python related, since that's my main online presence, but I'm fairly open about political and philosophical matters as well. FB, by contrast, rarely sees any mention of Python at all (and I'm often a little more restrained on the political and philosophical front).

Where I think Circles goes wrong is that it conflates Access Control with Topic Tagging. When I publish Python stuff, I'm quite happy for it to be public. However, I'd also be happy to tag it with "python", just as I do here on the blog, to make it easier for my friends to decide which of my updates they want to see.

This is classic Publish/Subscribe architecture thinking. When Publishing, I normally want to be able to decide who *can* access stuff. That is limited by closeness, but typically unrelated to the topic. Having, tagging as a service to my subscribers, I am quite happy to do. When Subscribing, I want to be able to filter by topic.

If I publish something to, say, my Pythonistas circle, than that does more than I want. Yes, it publishes it to them, but it also locks out everybody else. The ways I know people and the degree to which I trust them do not align well with the topics I like to talk about. I've already seen quite a few cases where the reshared versions of public statuses I have received have been limited access.

The more I think about it, the more I think I'm going to stick with my current policy of using it as a public micro-blog and pretty much ignore the fact that the Circles page exists.

Effective communication, brain hacking and diversity

Jesse Noller recently posted an interesting quote on Twitter:
"One who feels hurt while listening to harsh language may lose his mindfulness and not hear what the other person is really saying."
When you think about it, human language is a truly awe inspiring tool. By the simple act of creating certain vibrations in the air, marks on a page or electromagnetic patterns in a storage system, we're able to project our thoughts and feelings across space and time, using them to shape the thoughts and feelings of others.

While this ability to communicate is so thoroughly natural to most of us as humans that we typically take it for granted, it is actually an amazing world shaping capability deserving of our respect and attention.

And once we start giving it the attention it deserves, then we realise that we can judge the effectiveness of our own communication by looking at the communication that is subsequently reflected back at us. How well do those reflections mirror the thoughts and feelings that our own words were intended to create? It's the linguistic equivalent of running our code and seeing if it does what we wanted it to do.

All communication is a form of brain hacking, even when the only target is ourselves. We write lists of goals - formulating for our own benefit concrete plans of action that we can then tackle one step at a time. We write polemics, trying to engender in others some dim sense of the joy or outrage we feel with respect to certain topics. Sometimes we succeed, sometimes we fail. Sometimes we assume certain shared beliefs and understanding, so the point completely fails to come across to those without that common background.

For those closest to us, those with the most shared history, we have a rich tapestry of common knowledge to draw from. Movies we've all seen, books we've all read, events we all attended, discussions we were all part of - outsiders attempting to follow a transcript of our conversations would likely soon be utterly lost due to the shared subtext that isn't explicitly articulated (and the same is true for any group of close friends).

As groups get larger, the amount of truly common knowledge decreases, but there's still plenty of unwritten subtext that backs up whatever is explicitly articulated. In a certain sense, that unwritten subtext can be seen as the very definition of culture - it's the things you don't have to say because they're assumed. My past musings on the culture of python-dev are an example of this.

And that brings us to the point of considering the opening quote, diversity and questions of common courtesy. When speaking to friends, I can share truly awful jokes without offence because of the shared background information as to what is and isn't acceptable (and what things should be taken seriously). As the group being addressed gets larger, then the valid assumptions I can make about shared views of the world become fewer and fewer, so I have to start explicitly articulating things I would otherwise assume, and simply not say some things because I know (or at least strongly suspect) that they won't come across correctly to the audience I'm attempting to reach. Sometimes even addressing similar groups of people in a different context can change the assumptions as to what is a reasonable way to phrase things.

If a member of my target audience gets hung up on my wording or my choice of examples to the point where they miss the underlying message, then to a large degree, the responsibility lies with me as the originator of the communication. Now, I'm not a saint and make no pretence of being one. The rich fields of metaphors in English include many relating to subjects that are truly quite horrific or otherwise offensive to various groups of people. Sometimes I'm going to use that kind of phrasing without thinking about it, especially when talking rather than writing (my own innate tact filter is definitely set up to filter incoming communication, so applying tact in the outwards direction is a conscious process rather than something I do automatically). If such a miscommunication happens and someone points it out, then the onus is on me to admit that yes, my choice of words was poor and obscured my meaning rather than illuminating it. That's life, I make mistakes, and hopefully we can move on.

It's not entirely a one way street, though. Just as we apply contextual analysis to our understanding of historical writings, so it can be useful to apply the same approach to things that are said by current figures. Richard Dawkins recently made some ill-advised comments in relation to Skepchick's advice to men to avoid certain actions that make them look creepy (that's all she said, "Don't do this, it's creepy" and she copped flack for it, as if she'd said people doing it should be sent to prison or castrated or something equally extreme). Does the fact that Dawkins clearly didn't get why he was in the wrong make him a horrible human being or devalue his extensive contributions to our understanding of evolutionary biology*? No, it doesn't, any more than Isaac Newton's obsession with alchemy devalued his contributions to physics and mathematics. It just makes him a product of the culture that raised him. Hopefully he'll eventually realise this and publicly apologise for failing to give the matter due consideration before weighing in.

However, what really surprised me is the number of people that indicate they're shocked by his words, or questioning their support for his other activities just because he so vividly demonstrated his cluelessness on this particular topic. The world is a complicated place and the social dynamics of privilege, cultural blindspots and effectively encouraging diversity aren't one of the easiest pieces to comprehend. Hell, as a middle-class, 30-something, white, English-speaking, straight, cisgendered male living in Australia I'm quite certain that my own grasp of the topic is heavily coloured by the fact that on pretty much any of the typical grounds for discrimination I'm in the favoured majority (being an atheist is arguably the only exception, but that's far less of a problem here in Australia than it is in the US. Our Prime Minister is an acknowledged atheist and even the Murdoch media machine didn't really try to make much of an issue out of that before the last election). I do my best to understand the topic of diversity based on the experiences of those that actually have to deal with it on a daily basis, but it's still a far cry from seeing things first hand.

So, since I don't believe I can speak credibly to the topic of diversity directly, I instead prefer to encourage people to reflect on the value and nature of communication and community in general. Martin Fowler wrote an excellent piece about the challenge of creating communities that are welcoming to a diverse audience without rendering them bland and humourless (as the benign violations of expectations and assumptions that are at the heart of most humour often depend on the shared context that welcoming communities can't necessarily assume). This excellent video highlights the importance of focusing on actions (e.g. "This thing you said was inappropriate and you should consider apologising for saying it") rather than attributes (e.g. "You are a racist/misogynist/whatever"). If the latter is actually true, you're unlikely to change their mind and if it *isn't* true, you're likely to miss an opportunity to educate them as they get defensive and stop listening (refer back to that opening quote!).

I don't pretend to have all (or even any of) the answers, I just believe the entire topic of effective communication and all it entails is one worthy of our collective consideration, since effective communication is almost always a necessary precursor to taking effective action (e.g. on matters such as mitigating and coping with climate change).

In many respects though, the entire topic is really quite simple. To quote Abe Lincoln in one of my all time favourite movies:
Be excellent to each other.

----
* Seriously, read the popular science books on biology that Dawkins has written, especially "The Greatest Show on Earth". They're orders of magnitudes better than "The God Delusion", which is far too laden with angry and aggressive undertones to be an effective tool for communicating with anyone that doesn't already agree with the thesis of the book. In his biology books, his obvious love and passion for the subject matter comes to the fore and they're by far the better for it.

Note for Planet Python: even though this post is about communication rather than code, I have included the python tag since a couple of different diversity related issues have come up recently on python-dev and psf-members. It is no coincidence that "communication" and "community" share a common(!) root in "communis".

Sure it's surprising, but what's the alternative?

Armin Ronacher (aka @mitsuhiko) did a really nice job of explaining some of the behaviours of Python that are often confusing to developers coming from other languages.

However, in that article, he also commented on some of the behaviours of Python that he still considers surprising and questions whether or not he would retain them in the hypothetical event of designing a new Python-inspired language that had the chance to do something different.

My perspective on two of the behaviours he listed is that they're items that are affected by fundamental underlying concepts that we really don't explain well even to current Python users. This reply is intended to be a small step towards correcting that. I mostly agree with him on the third, but I really don't know what could be done as an alternative.

The dual life of "."


Addressing the case where I mostly agree with Armin first, the period has two main uses in Python: as part of floating point and complex number literals and as the identifier separator for attribute access. These two uses collide when it comes to accessing attributes directly on integer literals. Historically that wasn't an issue, since integers didn't really have any interesting attributes anyway.

However, with the addition of the "bit_length" method and the introduction of the standardised numeric tower (and the associated methods and attributes inherited from the Complex and Rational ABCs), integers now have some public attributes in addition to the special method implementations they have always provided. That means we'll sometimes see things like:
1000000 .bit_length()
to avoid this confusing error:
>>> 1000000.bit_length()
File "stdin", line 1
1000000.bit_length()
^
SyntaxError: invalid syntax
This could definitely be avoided by abandoning Guido's rule that parsing Python shall not require anything more sophisticated than an LL(1) parser and require the parser to backtrack when float parsing fails and reinterpret the operation as attribute access instead. (That said, looking at the token stream, I'm now wondering if it may even be possible to fix this within the constraints of LL(1) - the tokenizer emits two tokens for "1.bit_length", but only one for something like "1.e16". I'm not sure the concept can be expressed in the Grammar in a way that the CPython parser generator would understand, though)

Decorators vs decorator factories


This is the simpler case of the two where I think we have a documentation and education problem rather than a fundamental design flaw, and it stems largely from a bit of careless terminology: the word "decorator" is used widely to refer not only to actual decorators but also to decorator factories. Decorator expressions (the bit after a "@" on a line preceding a function header) are required to produce callables that accept a single argument (typically the function being defined) and return a result that will be bound to the name used in the function header line (usually either the original function or else some kind of wrapper around it). These expressions typically take one of two forms: they either reference an actual decorator by name, or else they will call a decorator factory to create an appropriate decorator at function definition time.

And that's where the sloppy terminology catches up with us: because we've loosely used the term "decorator" for both actual decorators and decorator factories since the early days of PEP 318, decorator implementators get surprised at how difficult the transition can be from "simple decorator" to "decorator with arguments". In reality, it is just as hard as any transition from providing a single instance of an object to instead providing a factory function that creates such instances, but the loose terminology obscures that.

I actually find this case to be somewhat analogous to the case of first class functions. Many developers coming to Python from languages with implicit call semantics (i.e. parentheses optional) get frustrated by the fact that Python demands they always supply the (to them) redundant parentheses. Of course, experienced Python programmers know that, due to the first class nature of functions in Python, "f" just refers to the function itself and "f()" is needed to actually call it.

The situation with decorator factories is similar. @classmethod is an actual decorator, so no parentheses are needed and we can just refer to it directly. Something like @functools.wraps on the other hand, is a decorator factory, so we need to call it if we want it to create a real decorator for us to use.

Evaluating default parameters at function definition time


This is another case where I think we have an underlying education and documentation problem and the confusion over mutable default arguments is just a symptom of that. To make this one extra special, it lies at the intersection of two basic points of confusion, only one of which is well publicised.

The immutable vs mutable confusion is well documented (and, indeed, Armin pointed it out in his article in the context of handling of ordinary function arguments) so I'm not going to repeat it here. The second, less well documented point of confusion is the lack of a clear explanation in the official documentation of the differences between compilation time (syntax checks), function definition time (decorators, default argument evaluation) and function execution time (argument binding, execution of function body). (Generators actually split that last part up even further)

However, Armin clearly understands both of those distinctions, so I can't chalk the objection in his particular case up to that explanation. Instead, I'm going to consider the question of "Well, what if Python didn't work that way?".

If default arguments aren't evaluated at definition time in the scope defining the function, what are the alternatives? The only alternative that readily presents itself is to keep the code objects around as distinct closures. As a point of history, Python had default arguments long before it had closures, so that provides a very practical reason why deferred evaluation of default argument expressions really wasn't an option. However, this is a hypothetical discussion, so we'll continue.

Now we get to the first serious objection: the performance hit. Instead of just moving a few object references around, in the absence of some fancy optimisation in the compiler, deferred evaluation of even basic default arguments like "x=1, y=2" is going to require multiple function calls to actually run the code in the closures. That may be feasible if you've got a sophisticated toolchain like PyPy backing you up but is a serious concern for simpler toolchains. Evaluating some expressions and stashing the results on the function object? That's easy. Delaying the evaluation and redoing it every time it's needed? Probably not too hard (as long as closures are already available). Making the latter as fast as the former for the simple, common cases (like immutable constants)? Damn hard.

But, let's further suppose we've done the work to handle the cases that allow constant folding nicely and we still cache those on the function object so we're not getting a big speed hit. What happens to our name lookup semantics from default argument expressions when we have deferred evaluation? Why, we get closure semantics of course, and those are simple and natural and never confused anybody, right? (if you believe I actually mean that, I have an Opera House to sell you...)

While Python's default argument handling is the way it is at least partially due to history (i.e. the lack of closures when it was implemented meant that storing the expressions results on the function object was the only viable way to do it), the additional runtime overhead and complexity in the implementation and semantics involved in delaying the evaluation makes me suspect that going the runtime evaluation path wouldn't necessarily be the clear win that Armin suggests it would be.

Updated to do list for Python 3.3

I wrote a to-do list for 3.3 back in January. It's already obsolete, so inspired by Brett's recent update I have a new list of my own:

Finish PEP 394 (The "python" command on *nix platforms)

This is the PEP clarifying that the current collective recommendation from python-dev to Linux distributions (et al) is to stick with Python 2.x as the system python for the moment. It also involves adding a python2 symlink to the next release of 2.7 so that we follow our own advice.

PEP 3118 buffer API and memoryview fixes

On issue #10181 I've been doing the software architect thing in devising a way to address the flaws in the design of memoryview objects. Absent further contributions I may resort to coding it myself, otherwise I'll at least review whatever patches are put forward.

PEP 380 (yield from expressions)

Guido gave his approval to PEP 380 recently. This is almost ready to go in, but some work needs to be done on incorporating the tests neatly into the regression test suite.

Import engine

Import engine is a GSoC project I am mentoring that consolidates the scattered state for the import system into methods and attributes on a single object. While the "real" import system state will remain in the current locations for compatibility reasons, the import engine API will provide a more coherent interface to the whole mechanism. It will also allow creation of more limited engine instances for targeted imports (e.g. from plugin directories) that won't be confused by name conflicts with other libraries.

I'm hopeful Greg will get as far as writing the PEP itself before the end of the summer, otherwise that will become a follow-up activity.

CPython compiler enhancements

Eugene Toder put together a nice patch that refactors the CPython compiler to include an AST optimisation step as well as cleaning up several 2.x holdovers in the AST. Getting the AST compiler ready for Python 2.5 is one of the first things I ever hacked on as a core developer and it's still one of my main interest areas.

PEP 395 (module aliasing)

After starting to work with Django and discovering it commits the sin of putting a package subdirectory on sys.path, I have even more motivation to provide a module aliasing mechanism that doesn't run the risk of accidentally getting multiple copies of the same module loaded in a process. That kind of thing should only happen when you do it on purpose.

I need to review my proposal in PEP 395 to make sure it also covers the Django use case (and modify the proposal if it doesn't).

Core mentoring

There are still some coverage patches from the Pycon US sprints to be reviewed and committed, along with a couple from the core mentoring mailing list. I'm also likely to be leading the CPython sprints following PyconAU, so there'll be some more patches to review and commit as a result of that.

importlib bootstrapping

While the migration to importlib is definitely Brett's project, it's also one I keep a close eye on in case he wants any reviews or feedback on elements of the design.

PEP process tinkering

I'd still like to clean up PEP 0 by adding the "Consensus" state to PEP 1. However, it has dropped below the above items on the to-do list.

With statement tinkering

Screwing up @contextmanager for 3.2 gives me even more motivation to want implicit context managers and Alex Gaynor tells me he has a use case for a context manager that can skip the with statement body. As with the PEP process tinkering, these are still on the "yes, please" list, but I'm unlikely to get a chance to work on them any time soon.

Other miscellaneous features

Resurrecting the post-import hooks PEP would be nice and I'd also like to add some of the ABC registration viewing and monitoring features proposed on the issue tracker.

So, how long until 3.3 again? :)