The benefits (and limitations) of PYC-only Python distribution

This Stack Overflow question hit my feed reader recently, prompting the usual discussion about the effectiveness of PYC only distribution as a mechanism for obfuscating Python code.

PYC Only Distribution

In case it isn't completely obvious from the name, PYC only distribution is a matter of taking your code base, running "compileall" (or an equivalent utility) over it to generate the .pyc files, and then removing all of the original .py source files from the distributed version.

Plenty of Python programmers (especially the pure open source ones) consider this practice an absolute travesty and would be quite happy to see it disallowed entirely. Early drafts of PEP 3147 (PYC Repository Directories) in fact proposed exactly that - in the absence of the associated source file, a compiled PYC file would have been ignored.

However, such blatant backwards incompatibility aroused protests from several parties (including me), and support for PYC-only distribution was restored in later versions of the PEP (although "compileall" now requires a command line switch in order to generate the files in the correct location for PYC-only distribution).

Use Cases

As I see it, there are a couple of legitimate use cases for PYC-only distribution:
  • Embedded firmware: If your code is going onto an embedded system where space is at a premium, there's no point including both your source code and the PYC files. Better to just include the compiled ones, as that is all you really need
  • Cutting down on support calls (or at least making the ones you do get more comprehensible): Engineers and scientists like to tinker. It's in their nature. When they know just enough Python to be a danger to themselves and others, you can get some truly bizarre tickets if they've been fiddling with things and failed to revert their changes correctly (or didn't revert them at all). Shipping only the PYC files can help make sure the temptation to fiddle never even arises

Of the two, the former is by far the stronger use case. The latter is attempting a technical solution to a social problem and those rarely work out well in the long run. Still, however arguable its merits, I personally consider deterrence of casual modifications a valid use case for the feature.

Drawbacks

Stripping the source code out of the distribution does involve some pretty serious drawbacks. The main one is the fact that you no longer have the ability to fall back to re-compilation if the embedded magic cookie doesn't match the execution environment.

This restricts practical PYC-only distribution to comparatively constrained environments that can ensure a matching version of Python is available to execute the PYC files, such as:
  • Embedded systems
  • Corporate SOEs (Standard Operating Environments)
  • Bundled interpreters targeting a specific platform

Cross-platform compatibility of PYC files (especially for 32-bit vs 64-bit and ARM vs x86) is also significantly less robust than the cross-platform compatibility of Python source code.

Limitations

Going back to the SO question that most recently got me thinking about this topic, the big limitation to keep in mind is this: shipping only PYC files will not reliably keep anyone from reading your code. While comments do get thrown away by the compilation process, and docstrings can be stripped with the "-OO" option, Python will always know the names of all the variables at runtime, so that information will always be present in the compiled bytecode. Given both the code structure and the original variable names, most decent programmers are going to be able to understand what the code was doing, even if they don't have access to the comments and docstrings.

While there aren't any currently active open source projects that provide full decompilation of CPython bytecode, such projects have existed in the past and could easily exist again in the future. There are also companies which provide Python decompilation as a paid service (decompyle and depython are the two that I am personally aware of).

Alternatives

You can deter casual tinkering reasonably well by placing your code in a zip archive with a non-standard extension (even .py!). If you prepend an appropriate shebang line, you can even mark it as executable on POSIX based systems (see this post for more information).

You could also write your code in Cython or RPython instead of vanilla Python and ship fully compiled executable binaries.

There are minifier projects for Python (such as mnfy) that could be fairly readily adapted to perform obfuscation tricks (such as replacing meaningful variable names with uninformative terms like "_id1").

Comments

Comments powered by Disqus