After describing the project I'm working on to a few people at PyConAU and BrisPy, I decided it might be a good idea to blog about it here. I do have a bit of an ulterior motive in doing so, though - I hope people will point out when I've missed useful external resources or applications, or when something I'm planning to do doesn't make sense to the assorted Django developers I know. Yes, that's right - I'd like to make being wrong on the internet work in my favour :)

The project is purely internal at this stage, but I hope to be able to publish it as open source somewhere down the line. Even being able to post these design concepts is pretty huge for me personally, though - before starting with Red Hat a few months ago, I spent the previous 12 and a half years working in the defence industry, which is about as far from Red Hat's "Default to Open" philosophy as it's possible to get.

Mirror, Mirror, On The Wall

The project Red Hat hired me to implement is the next generation of their internal mirroring system, which is used for various tasks, such as getting built versions of RHEL out to the hardware compatibility testing labs (and, when they're large enough, returning the generated log files to the relevant development sites), or providing internal Fedora mirrors at the larger Red Hat offices (such as the one here in Brisbane).

There are various use cases and constraints that mean the mirroring system needs to operate at the filesystem level without making significant assumptions about the contents of the trees being mirrored (due to various details of the use cases involved, block level replication and approaches that rely on the transferred data being laid out in specific ways aren't viable alternatives for this project). The current incarnation of this system relies almost entirely on that venerable workhorse of the mirroring world, rsync.

However, the current system is also showing its age and has a few limitations that make it fairly awkward to work with. Notably, there's no one place to go to get an overview of the entire internal mirroring setup, and the direct use of rsync means it isn't particularly friendly with other applications when it comes to sharing WAN bandwidth and the servers involved are wasting quite a few cycles recalculating the same deltas for multiple clients. Hence, the project I am working on, which is intended to replace the existing system with something a bit more efficient and easier to manage, while also providing a better platform for adding new features.

Enter Pulp

Pulp is an open source (Python) project created by Red Hat to make it easier to manage private yum repositories. Via Katello, Pulp is one of the upstream components for Red Hat's CloudForms product.

The Pulp project is currently in the process of migrating from their original yum-specific architecture to a more general purpose Generic Content plugin architecture. It's that planned plugin architecture that makes Pulp a useful basis for the next generation internal mirroring system, which, at least for now, I am imaginatively calling pulpdist (referring to both "distribution with Pulp", since that's what the system does, and "distributed Pulp instances", since that's how the system will work).

The main components of the initial pulpdist architecture will be:

I'll be writing more on various details that I consider interesting as I go along. Initially, that will include my plan for the mirroring protocol to be used between the sites, as well as various decisions that need to be made when spinning up a Django project from scratch (while many of my specific answers are shaped by the target environment for internal deployment, the questions I needed to consider should be fairly widely applicable).