Spinning up the pulpdist project

Nick Coghlan

2011-09-27 16:23

One novel aspect of the pulpdist project is that it is starting with an almost completely blank slate from a technology point of view (aside from the decision to use Pulp as the main component of the mirroring network). Red Hat does have development standards for internal projects, of course (especially in the messaging space), but they're fairly flexible, leaving the individual tool development teams with a lot of options. If something ships with Fedora and/or RHEL, or is available under licensing terms that would be acceptable for inclusion in Fedora (and subsequently RHEL), then it's fair game.

This post focuses on the design of the management server. I'll write up a separate post looking at the currently planned design for the Pulp data transfer plugins.

Source Control

Unsurprisingly, Red Hat's internal processes are heavily influenced by Linux kernel processes. Accordingly, the source control tool of choice for new projects is Git. While I have a slight preference for Mercurial (due mainly to familiarity), I'm happy enough with any DVCS, so Git it is.

Primary Development Language

Python, of course. You don't hire a CPython core developer to get them to work on a Ruby or Perl project (although the current system I'm replacing was written in Perl). As a web application, there will naturally be some Javascript and CSS involved as well.

Web Framework

The main management application for pulpdist is going to be a full-scale web application. User profiles and authentication, database storage, communication with other web services, provision of a REST API, integration with the engineering tools messaging bus. Basically, micro-frameworks need not apply.

While I expect Pyramid/Pylons would also have been able to do the job, I decided to go with Django 1.3. This was heavily influenced by social factors: I know a lot of Django devs that I can bug for advice, but the same is not true for Pyramid. The complexity of the whole Pyramid/Pylons/TurboGears setup is also not appealing - while veteran web developers may find the "you decide" approach a selling point, Django's batteries included approach makes it far simpler to get started quickly, and decide as I go along which pieces I should keep, discard or replace.

I've heard some experienced Django developers muttering complaints about the class based views design in 1.3, but as someone coming in that is an experienced Python developer, but a relatively noobish web developer, the CBV approach seems eminently sensible, while the old function based approach looks repetitive and insane. Object oriented programming was invented for a reason!

I'll admit that my perception may be biased by knowing exactly how to make multiple inheritance work the way I want it to, though :)

Web Server

The management server doesn't actually have that much work to do, so the basic Apache+mod_wsgi configuration will serve as an adequate starting point (any heavy lifting will be done by the individual Pulp instances, and the main data traffic on those doesn't run through their web service). WSGI provides the flexibility to revisit this later if needed.

I've also punted on any web caching questions for now - the management server is low traffic and once the access to the Pulp sites is pushed out to a backend service, it should be fast enough at least for the early iterations.

Authentication & Authorisation

The actual user authentication task will be handed off to Apache and all management application access will be restricted to Kerberos authenticated users over SSL. Django's own permissions systems will be used to handle authorisation restrictions. (The experimental prototype will use Basic Auth instead, since it is the Apache/Django integration the prototype needs to cover, not the Apache configuration for SSL and Kerberos authentication)

Integration with Pulp's user access controls is via OAuth, but the design for configuration of user permissions in the Pulp servers is still TBD.

Database and ORM

Again, the management server isn't doing the heavy lifting in this application. The Pulp instances use MongoDB, but for the management server I currently plan to use the standard Django ORM backed by PostgreSQL. For the prototype instance, the database is actually just an SQLite3 file. I'm not quite sold on this one as yet - it's tempting to start playing with SQLAlchemy, since I've already had to hack around some of the limitations in the native ORM in order to store encrypted fields. OTOH, I already have a ton of things to do on this project, so messing with this is a long way down the priority list.

Schema and data maintenance is handled using South.

HTML Templating

The standard Django templating engine should be sufficient for my needs. As with the ORM, it's tempting to look into upgrading it to something like Jinja2, but once again 'good enough' is likely to be the deciding factor.

For data table display, I'm using Django Tables 2 and form display will use Django Uni-Form.

REST API

The REST API for the service is currently there primarily as a development aid - it lets me publish the full data model to the web as soon as it stabilises (and even while its still in flux), even if the UI for end users hasn't been fully defined. This is particularly useful for the metadata coming back from the Pulp server, since it doesn't need much post-processing to be included as raw data in the management server's own REST API. The JSON interface will also allow much of the backend processing to be fully exercised by the test suite without worrying about web UI details.

The design of the REST API was heavily influenced by this Lessons Learned piece from the RHEV-M developers. The Django Rest Framework means I can just define the data I want to display as a list or dictionary and the framework takes care of formatting it nicely, including rendering URLs as hyperlinks.

AMQP Messaging

I haven't actually started on this aspect in any significant way, but the two main contenders I've identified are python-qpid (which is what Pulp uses) and django-celery (which would also give me an internal task queue engine, which the management server is going to need - the prototype just does everything in the Django process, which is OK for experimentation on the LAN, but clearly inadequate long term when talking to multiple sites distributed around the planet). At this early stage, I expect the internal task management aspect is going to tip the decision in favour of the latter.

Testing Regime

As the foundation for the automated testing, I'm going with Django Sane Testing (mainly based on the example of other internal Django projects). Michael Foord's mock module lets me run at least some of the tests without relying on an external Pulp instance (fortunately, the namespace conflict with Fedora's RPM building utility 'mock' was recently resolved with the latter's support library being renamed to 'mockbuild').

Continuous integration is an open question at this point. Pulp uses Jenkins for CI and I'm inclined to follow their lead. The other main possibility is to use Beaker, Red Hat's internal test system originally set up for kernel testing (one key attraction Beaker offers is the ability to set up multi-server multi-site testing in a test recipe so I can run tests over the internal WAN).

Packaging

Tito is a tool for generating SRPMs and RPMs directly from a Git repository. For my own packages, this is the approach I'm using (with handcrafted spec files). For some strange reason, the sysadmins around here like it when internal devs provide things as pre-packaged RPMs for deployment :)

Packaging of upstream PyPI dependencies that aren't available as Fedora or RHEL packages is still a work in progress. I experimented with Tito and git submodules (which doesn't work) and git subtrees (which does work, but is seriously ugly). My next attempt is likely to be based on py2pack, so we'll see how that goes (I actually discovered that project by searching for 'cpanspec pypi' after hearing some of the Perl folks here extolling the virtues of cpanspec for easily packaging CPAN modules as RPMs).

I also need to switch to using virtualenv to get a clearer distinction between Fedora packages I added via yum install and stuff I picked up directly from PyPI with pip.

Mirror All The Things!

Nick Coghlan

2011-09-09 02:54

Comments

After describing the project I'm working on to a few people at PyConAU and BrisPy, I decided it might be a good idea to blog about it here. I do have a bit of an ulterior motive in doing so, though - I hope people will point out when I've missed useful external resources or applications, or when something I'm planning to do doesn't make sense to the assorted Django developers I know. Yes, that's right - I'd like to make being wrong on the internet work in my favour :)

The project is purely internal at this stage, but I hope to be able to publish it as open source somewhere down the line. Even being able to post these design concepts is pretty huge for me personally, though - before starting with Red Hat a few months ago, I spent the previous 12 and a half years working in the defence industry, which is about as far from Red Hat's "Default to Open" philosophy as it's possible to get.

Mirror, Mirror, On The Wall

The project Red Hat hired me to implement is the next generation of their internal mirroring system, which is used for various tasks, such as getting built versions of RHEL out to the hardware compatibility testing labs (and, when they're large enough, returning the generated log files to the relevant development sites), or providing internal Fedora mirrors at the larger Red Hat offices (such as the one here in Brisbane).

There are various use cases and constraints that mean the mirroring system needs to operate at the filesystem level without making significant assumptions about the contents of the trees being mirrored (due to various details of the use cases involved, block level replication and approaches that rely on the transferred data being laid out in specific ways aren't viable alternatives for this project). The current incarnation of this system relies almost entirely on that venerable workhorse of the mirroring world, rsync.

However, the current system is also showing its age and has a few limitations that make it fairly awkward to work with. Notably, there's no one place to go to get an overview of the entire internal mirroring setup, and the direct use of rsync means it isn't particularly friendly with other applications when it comes to sharing WAN bandwidth and the servers involved are wasting quite a few cycles recalculating the same deltas for multiple clients. Hence, the project I am working on, which is intended to replace the existing system with something a bit more efficient and easier to manage, while also providing a better platform for adding new features.

Enter Pulp

Pulp is an open source (Python) project created by Red Hat to make it easier to manage private yum repositories. Via Katello, Pulp is one of the upstream components for Red Hat's CloudForms product.

The Pulp project is currently in the process of migrating from their original yum-specific architecture to a more general purpose Generic Content plugin architecture. It's that planned plugin architecture that makes Pulp a useful basis for the next generation internal mirroring system, which, at least for now, I am imaginatively calling pulpdist (referring to both "distribution with Pulp", since that's what the system does, and "distributed Pulp instances", since that's how the system will work).

The main components of the initial pulpdist architecture will be:

a front-end (Django 1.3) web app providing centralised management of the entire distribution network
custom importer and distributor plugins for Pulp to handle distribution of tree changes within the distribution network
custom importer plugins to handle the import of trees from their original sources and generation of any additional metadata needed by the internal distribution plugins
generic (and custom, if needed) plugins to make the trees available to the applications that need them

I'll be writing more on various details that I consider interesting as I go along. Initially, that will include my plan for the mirroring protocol to be used between the sites, as well as various decisions that need to be made when spinning up a Django project from scratch (while many of my specific answers are shaped by the target environment for internal deployment, the questions I needed to consider should be fairly widely applicable).