TripleO and Golden Images

I spend a good deal of my time these days working upstream on OpenStack TripleO. TripleO has made a lot of great progress recently at being able to deploy and manage a production quality OpenStack cloud. Lately, I’ve seen a lot of growth in the community activity and interest.

I figured a good way to help provide some background on some of the architectural concepts of TripleO would be to do a series of blog posts going into a little more detail about some of these points and the reasoning behind them.

First off though, I don’t take credit for using these ideas I’m going to cover :). A lot of different hard working folks have helped to define TripleO into what it is today.

Secondly, there are lots of other great resources out there on TripleO. A youtube search will pull up a lot of videos and presentations from different folks, including some from community members that bootstrapped the project. These are excellent resources, so if you have further interest, check those videos out!


So, jumping into the first topic I’d like to talk about: Golden Images.

TripleO’s goal is to deploy OpenStack using OpenStack itself wherever possible. Given that the unit of deployment in the typical cloud model is an image (qcow2, raw, ovf, etc), it’s not surprising that TripleO deploys to physical baremetal hardware nodes using pre-built images. The deployment process itself of course uses Nova, which has 2 available baremetal drivers, nova-baremetal and ironic. They both work roughly the same way — when launching an image on a barmetal node, the qcow2 image is converted to raw and dd is used to write the bits to the physical disk of the baremetal node over iscsi.

The images contain all of the operating system bits, OpenStack software, and initial bootstrap configuration. They are “installed” images, as opposed to installation media themselves. This can be a bit confusing at first because it’s not necessarily typical when deploying to baremetal. You might instead be used to doing baremetal provisioning by actually running an operating system installer such as anaconda with a kickstart file. The TripleO process is more akin to baremetal imaging vs. baremetal provisioning in the traditional sense.

This model makes good sense for cloud. When you boot a vm in a cloud, you’re booting an image that is already installed, e.g., you don’t have to run an installer once the vm is up. So, it makes sense that TripleO would make use of this model and apply it to baremetal deployments as well. Remember, the whole point of TripleO is to prefer and use OpenStack itself.

The deployment process is also much quicker than provisioning via installers. You’re typically only bound by network speed and disk I/O. A rack of baremetal servers with a gigabit switch and all SSD’s, can be provisioned in just a few minutes.

To that end, TripleO provides an image building tool, diskimage-builder, which  provides support for building images for many well known Linux distros. The output from diskimage-builder is a qcow2 image that can be used as a baremetal physical disk. diskimage-builder customizes images by applying what it calls elements during the build process. At their core, elements themselves are just scripts.  The script based nature of the elements provides a practically universal entry point for any customization method you choose. You can write elements to apply puppet modules, or install distribution packages, or custom scripts even.

There’s an existing repository of elements for setting up OpenStack software called tripleo-image-elements. For TripleO’s puposes, this is where all the installation logic of how to install and configure OpenStack lives.

It’s a valid concern, that it could be a bit difficult to build a qcow2 image that will boot on all available hardware out there. But, that concern shouldn’t be overstated. Practically every major Linux distribution can produce a live iso variant (some *only* produce live iso’s in fact) that will boot on 99% of commodity hardware out there, so the same can be accomplished here.

That sounds great, but what if you aren’t using commodity hardware or have a heterogeneous environment with some hardware using specialized network cards for instance and some not? That’s a problem that would be solved during the image build process itself. You could write an element that adds driver support for the hardware. If you didn’t want this driver in all your images, you could build a set of images just for the special hardware, and make sure the correct images are used for that hardware in any number of ways (a custom Nova flavor for instance).

This process is really conceptually no different than what might have to be done if using an operating system installer instead for baremetal provisioning. Let’s say you’re installing RHEL throughout your environment and you need to add support for some set of specialty hardware because that support is not available in base RHEL.  You likely host a yum repository internally, or use a custom Red Hat Satellite channel to host the 3rd party packages. You then enable that repository and install the packages in your kickstart file.

In either case, you’re enabling support for specialty hardware by installing additional drivers or packages. It doesn’t really matter if you do that at image build time or system installation time.

What about the “Golden” part of the images? The golden implies that the images you plan to deploy in production are “known good” and are the exact same images that you have tested with and passed your CI process. One of the things that makes this attractive is that there is much less room for drift across your environment. If all your systems are deployed from the same set of images, then that should eliminate questions about what package sets are on which systems, what updates have been applied where, etc.

To abuse a software engineering term, in a way deploying via images is more “idempotent” than running installers. Which is more likely to produce the same bit-for-bit result every time? Converting a qcow2 file to raw and dd’ing it to a physical disk, or yum/apt-get installing hundreds of packages, many of which run scripts?

The idempotent nature of this deployment model is especially important in CI/CD environments where you want to be sure you’re deploying what you’ve actually tested. And, I think that’s what I’ll plan on highlighting in my next post, CI/CD and how it fits in with TripleO. Stay tuned.

 

Tags: ,

Posted in Cloud

3 Responses to “TripleO and Golden Images”

  • Duncan Thomas says:

    A nice analysis and explanation of golden images, thanks.

    What you don’t mention is their downsides – upgrades. If you update the golden image then you need to reboot, you can’t use (pure) golden images and avoid rebooting (unless you do some really smart, probably hard to get right pivot-root magic). If you /do/ want to support live upgrade, then you need to support something like installer / package based, and so now you’ve two things to manage and similar risks for drift etc.

    • slagle says:

      Duncan, apologies on a late reply. I definitely need to set up email notifications for comments on this blog :).

      You are absolutely correct on the downsides about upgrades that you have pointed out. And of course there are other downsides as well: such as build/test times and management of images.

      I think it’s about trade-offs. We can’t honestly expect people (or large enterprises) to abandon entrenched technologies like package based updates or installers over night. If we can deliver tooling that allows users to test out potentially disruptive deployment methodologies like golden images without adopting them wholesale, I think we would see more willingness to test these methods.

      It’s hard to argue the merits of forcing a reboot to apply a small bug fix or security update. I would suggest we offer a compromise there: allow users to update systems deployed from golden images using their existing update tooling.

      Where golden images start to gain some ground is when you start considering continuous deployment and integration. It’s difficult to CI test a package upgrade if you have thousands of nodes with a scattering of different package sets or one off changes. Systems management and devops tools can help, but if the goal of those tools is just to ensure that deployed systems are staying in sync in terms of running software versions, it makes sense to me to ensure that consistency on the front end by building, deploying and upgrading via golden images.

      • If you do CI the way github, facebook etc do CI, (100+ changes a day) then the need for a reboot starts to become an absolute blocker – assume it takes 10 minutes to reboot and resync a node – that’s 16 hours a day the node is not providing useful service – not really useful.

        I absolutely agree on the advantages of golden images, I’ve plenty of experience of using them in an HPC cluster context, but providing continuity of service during upgrade on a golden image based system is a really, really hard problem. Some of the pivot root style changing between two golden images on a live, running system that is being looked at in TripleO has potential I think, though the test matrix is going to be fun to get right. I suspect a mixture of this approach and a hybrid ‘push this fix now’ toolset of some form is going to be the way forward.


Leave a Reply