automation

Archived Posts from this Category

making updates easier

Posted by rhelmer on 30 Jul 2008 | Tagged as: automation, mozilla, releng

For a few months now, I’ve been working in my spare time on a way to make configuring and serving updates to Mozilla-based applications easier.

Mozilla updates are MAR files, which are linked to by the Automatic Update Service (aka AUS2). Several tools are involved in the making of updates for production releases, chiefly Patcher, driven by the release automation framework for releases. Nightly updates use a simpler script which automatically determines where builds should be updated to; Patcher needs every update path to be explicitly specified in it’s config file.

Both Patcher and the nightly script call the update-packaging tools to do the work of generating MAR files, which in turn use the “mar” utility (supports tar-like arguments to manipulate MAR files, e.g. “mar -t file.mar”, “mar -x file.mar”, etc.) and the “mbsdiff” utility, which generates binary patches using a modified version of bsdiff.

The update-packaging tools are in need of a makeover too, but that is a story for another day.

Getting back to how updates are served – Patcher’s other job is to generate thousands of text files, which are used to configure AUS. Every possible update path, like this one for 3.0b3, is actually generated dynamically from two text files (partial.txt and complete.txt) which reside in a directory layout that is similar, but in a slightly different order, than the information in that URL (…/product/version/buildid/buildTarget/locale/channel/update.xml). These complete.txt and partial.txt files have gone through two revisions in their file format, in the first variables for the generated XML like updateType, URL to the MAR file, etc. are on a specific line number. In the second (“version=1″), key/value pairs are used.

AUS2 configuration files only reflect the current state of the system; for releases the history is in Patcher config files (Config::General). The release automation scripts automatically update and check this file into CVS, so it’s not too painful to deal with in most situations. There are some outstanding bugs but overall it does what it is supposed to do.

However, it took me a very long time to get a handle on the above, and I think the separation between Patcher and the AUS server is not very useful. In fact, the method of explicit updates for all is downright unhelpful; every single release (e.g. 2.0.0.15), the following happens:

  1. partial updates are generated from 2.0.0.14->2.0.0.15
  2. every previous release (2.0.0.[1,2,3,4,...]) is pointed to the same 2.0.0.15 update

That means generating and publishing two text files for each (release * platform * locale) combination, which all contain exactly the same data. Also I think that taking a hint from the way the nightly system works would be useful here; 2.x should automatically point to the latest *unless* explicitly overridden, it should not require explicit configuration to do the norm. Finally, the nightly and production system should not be so different; every nightly update is a lost opportunity to test pre-releases of the production system, and having forked systems is bad for bugfixing and feature porting (note that there are no nightly updates for locales other than en-US, for example).

So, I’ve been thinking for a long time about how to make tools that are easier to use, understand and extend. One idea is to have the AUS server configuration be a database, not a giant tree of text files, and have the data in one place (not stored in a config file which is expanded to a giant tree of text files by a separate app). Another is to provide a simple API, and a few command line tools which use this API to modify update data and export it.

The conceptual model right now is that each release contains one update, which contains two patches (one partial, one complete). Both the database schema and the API reinforce this model.

Here’s what I have working so far. In case it’s not obvious, this is most definitely an early “throw the first one away” prototype:

The schema is based on Lars’ fine work on the subject, although I did modify it slightly. This schema is not totally done yet either, for example foreign keys aren’t actually hooked up, but there’s enough there to see that it works. There’s a run.py command in that directory that calls the importer and exporter correctly.

This means that you can read existing AUS2 data into a database (if you have it), and create or manipulate update information using the API from Python (or directly with SQL, if you like). You can generate update.xml files and put them straight onto a webserver.

What I’ve put together needs quite a lot more work, but I wanted to open it up for comment. Here’s what I think is remaining, at least:

  • database should hold the history of updates, not just the current state
  • need a web service which talks directly to the database, as an alternative to pre-generating all update.xml files.
  • should use existing libs for the DB ORM (SQLAlchemy maybe?), generating XML, etc. not the home-grown things I threw together
  • I think it would be advantageous to make the model/schema/API more sophisticated and normalized (e.g. updates could belong in multiple channels), but I don’t want to go beyond the essentials quite yet.
  • the new update-packaging tools should be able to read data from this system in order to automatically determine the appropriate “from” release to base partial MARs on, and also there should be some way to register that new updates are available, that access would be internal and append-only (e.g. only needs SELECT, INSERT).

I think that to solve the first, update paths should be explicitly configured once, but there needs to be business logic in the server app (or update.xml file generator) which overrides this when a newer release is available. For instance, if a user is on version 1.0 and version 1.1 is available which has a partial for 1.0, then the partial 1.0->1.1 should be served. However, if version 1.2 is available, then the complete 1.0->1.2 update should be served.

The second problem has more to do with the burden inherent in handling tens of thousands of text files (e.g. backing them up or restoring them can take a very long time), although I believe that it is useful to have the option to pregenerate the path/update.xml files, especially for people without so many updates as mozilla.org is pushing each release.

Anyway, comments welcome! Certainly feel free to nudge me if it looks like I’m going off the rails here, but I think this approach could make things a little better in update-land. I’ll take patches too, but if anything serious comes of this I’ll probably clean up and move over to Mozilla’s repo, and rewrite a bunch, so don’t take the current implementation too seriously..

releases on tap

Posted by rhelmer on 10 Jul 2008 | Tagged as: automation, mozilla, releng

One of the things that was pounded into me while working at MoCo is the idea of having a bug tracker and using it. I literally can’t work without one anymore. It’s the first thing I really pushed for at my new job (they were using various ad-hoc systems for project management, but not a real bug tracker for the software dev side). I’ve realized that I just can’t keep everything in my head, various notepads and text files, etc. and expect to get anything done, or let anyone know what my priorities are.

In return, I really tried to hammer in the idea of fast, automated release cycles. We spent a lot of time (and the release engineering team does still spend a lot of time) wrapping the build system and other tools so that they can be run and the output verified automatically, chasing that ideal of the Formula One-style hand-off to QA and to the users.

The way releases work now is incredible, just night and day from when I started at MoCo a little over two years ago. However, there’s one thing that’s always bugged me, and since I just had the opportunity to set up an automated build/release environment, I thought I’d expound a little bit on it.

The one thing is that nightly builds of Firefox just aren’t the same as the release builds. The way updates work is different, branding is turned on, bits are signed (on Windows), the directory structure for files is different. Firefox releases are actually rebuilt from source for each release.

So what? None of these, even added up, are a big deal, right? Obviously releases work fine, and there are a ton of great people (and the tools they’ve made) that make sure that nothing is missed because of this. But wouldn’t it be great if we could just take the nightly updates and builds that have already been put through the ringer by thousands of people, and give those straight to QA? Or if we can’t have that, how about at least have the release builds put through the same tests and available to QA immediately after checkin?

Am I pushing some fanciful, architecture-astronaut utopian vision? I don’t think so, because this is how I’ve done releases in the past, and this is how I do releases now. Let me tell you about it.

I use Hudson, which I can’t recommend highly enough (well, if you’re not allergic to Java, I guess). It makes this kind of process easy. It’s not necessary to use it to achieve this of course, I’m just throwing this out as a data point.

On each checkin:

  • a unique build number is generated
  • a new build is generated (I also have it run unit tests, and install the software to run functional tests)
  • release files and other artifacts like build logs are archived, and checksums of the files are stored
  • if anything goes wrong, the team and the developer who checked in the latest change are notified

The software is available to QA as soon as this automated process is complete. When it’s time to release, I can tag the build via the web UI (although it’s easy enough to do outside of Hudson if you have the build number, which in turn contains the branch/datestamp/revision info needed).

Having the next release always “on tap” makes it easy for me to largely ignore the build/release side of things, and focus on developing software, writing tests, and tracking down problems.

Now, Mozilla’s situation is way more complicated, which I alluded to a bit earlier. This post isn’t a “see what I can do!” rant as much as a “look what’s possible!” idea. I think that this kind of setup is totally doable for Mozilla’s products, but there are some serious issues:

  • branding is turned on at compile time. having nightly builds not called “Firefox” is a *good* thing, as otherwise end-users would be very confused.
  • “–enable-tests”, needed for unit tests, cannot be run in release builds at the moment (for technical reasons outside the scope of this post; I’m sure there are bugs on this)
  • release builds are signed and have a different filename format and directory structure (e.g. “firefox-3.0.pre.en-US.win32.installer.exe” for nightly versus “3.0/win32/en-US/Firefox Setup 3.0.exe”)
  • release builds are cryptographically signed, to assure users that these files really were created by MoCo (regardless of what mirror or download site they may have come from).
  • nightly updates are only for en-US, and use a different set of tools to generate updates, and a different mode of the update server to serve updates (some ideas for fixing this problems are in bug 410806, but again this is outside of the scope of this post)

So all of these are pretty much good things (branding, signing, etc.) or technical issues that could surely be fixed (nightly updates, unit tests). Arguably, nightly users and release users tend to be very different people, with very different needs and expectations, so all of the “intentional problems” here are really good things. This pretty much eliminates the possibility (as far as I can see) that Firefox release engineers could take a nightly build and be able to ship that as a release build.

Even if the branding issue were solved (e.g. repackaging), signing still needs to be done, partial diff files would need to be regenerated, and probably other things that I’m overlooking. The automated tests that were run on the nightlies may not be applicable (you may scoff at the paranoia, but there was a bug regarding the size of the Vista icon in official branding found late in the Fx3 beta cycle which caused a bunch of grief. This situation was improved by making a Minefield version of the same icon, which is a good fix, but I think my point still stands).

Here’s another option – why not create a real, honest-to-god Firefox release build, on each checkin (or at least alongside each nightly build)? This at least makes it available to QA as soon as humanly possible, and it could probably be opened up somehow to interested community testers (human-triggered builds are right now, just put into a special area).

Maybe I’m just spoiled working on little tiny projects, but I think even the already super-fast and extensively tested Firefox releases could be made super-faster and the tests extensivlier, at the cost of freeing up the release engineers of the need to babysit the One and Final Release Build.

openSUSE build service

Posted by rhelmer on 09 Jul 2008 | Tagged as: automation, releng

Looks like the openSUSE build service can package up your software for a bunch of different Linux distributions, cross-compile, track upstream project dependencies (e.g. rebuild your GTK app when GTK changes), and runs on their servers so you don’t have to maintain the thing.

Add in Windows and Mac support and they might have something there :P

This might be a great idea for a higher-level “cloud computing” service, setting up and maintaining this kind of infrastructure is a huge problem for many companies.

rel-o-mation slideware!

Posted by rhelmer on 04 Apr 2008 | Tagged as: automation, mozilla, releng

I put this set of slides together to explain what state the release automation project is in. It probably makes more sense when I am sitting there to explain what each point means, but I figured I’d put it out there anyway :)

The current setup mimics ye olde manual release process, forged by Chase. Over the past few years we’ve worked on wrapping that process in scripts with this perl framework (aka “bootstrap”), which auto-generates configs for underlying systems like tinderbox and patcher, checks logs for errors, etc.

A lot of the current bugs come from underlying systems, especially the tinderbox client. Reducing some of the complexity here would both make the system more understandable and most likely faster as well. It’s pretty tough to make changes when you’re doing this level of wrapping, too.

Now that everything is driven by Buildbot, it probably makes the most sense to call the build system directly, instead of buildbot->bootstrap->tinderbox->build_system that we have today. There are bugs on all of this already, hopefully the slides and this post will make it clearer how they tie together.

moving 1.8 nightlies to release machines March 5 2008

Posted by rhelmer on 04 Mar 2008 | Tagged as: automation, buildbot, mozilla, releng, tinderbox

As previously announced on Tinderbox and planet, we’re migrating nightly production to running on the same machines as release production.

On the moz1.8 branch, we’ve been running the new nightlies in parallel with the “traditional” nightlies since Feb 15 2008, and are going to switchover live tomorrow.

The new machines:
* production-pacifica-vm
* production-prometheus-vm
* bm-xserve05

The old machines:
* pacifica-vm
* prometheus-vm
* bm-xserve02

Starting tomorrow, the performance machines will begin following the new machines. The new machines will publish updates and nightly builds to the usual location, and the old machines will be disabled (but kept around for a while, just in case).

If there is a reason that we should not proceed, or if you see any problems after the migration, please update bug 417147 or email build@mozilla.org.

Thanks!
Rob

moving nightly Mozilla1.8 Firefox to release automation system

Posted by rhelmer on 14 Feb 2008 | Tagged as: automation, buildbot, mozilla, tinderbox

I’ve just enabled nightly builders from the release automation system on the Mozilla 1.8 tree (see bug 417147 for details).

I’ve blogged on this previously, but just to reiterate some of the reasons:

  • unify the (very fragmented) nightly and final release processes (tools, procedure, etc).
  • move away from Tinderbox client to Buildbot
  • use the same set of machines for both nightly and release

The first point is a really big one for me, using totally different tools for nightly and release means that we don’t get much testing of our release-only procedure and tools, so we often hit unexpected bugs on release day, and it also leaves nightly users without the benefits we provide for releases like automated update verification, updates for all locales, thorough error checking and monitoring of build machines, automated staging runs before pushing changes live, for a start.

The current setup still uses Tinderbox, it’s just being invoked by Buildbot, so developers should notice no change besides new hostnames. We’re trying this out on 1.8 branch first before we tackle 1.9, so far it has been quite smooth but please let us know if you notice anything out of the ordinary. We have not switched over perf tests yet, but we expect the results to not change (although we may want to merge some graphs for developer convenience, etc). This will happen before the old machines are turned off.

We’re planning on turning off the older 1.8 builders sometime after February 25th, so please do let us know if you see any problems. I’ve left a note with the names of the new builders at the top of the Mozilla1.8 Tinderbox tree.

This is only one tiny step towards improving life both for the build&release group and also developers and nightly testers, but it’s quite significant from an infrastructure point of view, and has been brewing for a long time. I’m not sure what the next steps are going to be, but I’ve written up some thoughts on where I think we should go and why.

perf impact on nightly release automation move

Posted by rhelmer on 28 Dec 2007 | Tagged as: automation, buildbot, mozilla, tinderbox

If you care about the behavior of the Firefox perf test machines, please check out my post moving Mozilla1.8 tinderboxes to Buildbot – perf impact in the mozilla.dev.performance newsgroup.

The big question is whether we can move to a model where we only build on checkin rather than continuously. This change would mean faster build turnaround times for developers, and a reduced load on build machines. It also means that the perf machines cycle less often. Currently, there’s no way to disambiguate the start time of the run versus the latest revision in the build (for CVS, revision in this sense is branch+timestamp), Tinderbox and graph servers all expect build and perf run to be the same thing.

In case you’re wondering why I’m worried about the Mozilla1.8 tree, if all goes well with this rollout we’ll want to do it this way on Firefox tree as well; the Mozilla1.8 branch is stable and already on release automation, so we think it makes sense to start there first.

tinderbox to buildbot: moz18 branch

Posted by rhelmer on 19 Dec 2007 | Tagged as: automation, buildbot, mozilla, releng, tinderbox

I’ve set up the release automation staging server for the Mozilla 1.8 branch (Firefox 2.x) to also generate nightly builds and depend builds on checkin to the branch (using buildbot’s BonsaiPoller). I outlined some of the advantages to this release automation/nightly+depend integration in my previous post.

You can see the results on the Mozilla1.8-Staging Tinderbox tree

The main impediment to taking this live are the performance test machines. These machines currently only cycle when a new build is available, but ideally we’d want them to keep re-testing the same build as many times as possible, to get more stable test results. Since the Tinderbox-driven depend builds currently continously cycle instead of waiting for checkin, we tend to get several test builds for the same source code as a side effect.

These machines forge their start time to match that of the build it came from which allows for easily matching up checkins and build times to performance results, but this doesn’t really make sense if we’re doing multiple test runs per build.

I’ve started a thread in the mozilla.dev.builds newsgroup with the subject “moving Mozilla1.8 tinderboxes to Buildbot” for general discussion about this idea.

tinderbox to buildbot, step 1

Posted by rhelmer on 06 Dec 2007 | Tagged as: automation, buildbot, mozilla, releng

As I mentioned previously, I’ve been working on incrementally moving our Tinderbox client installs over to Buildbot.

The first step is to switch from driving Tinderbox from Buildbot and our release automation system, instead of being driven on each machine from the multi-tinderbox script. The release automation still calls Tinderbox client underneath, so features like CLOBBER support and all of your other favorites remain.

I have the staging automation builders publishing to the Tinderbox MozillaStage tree. Note that it’s using Mozilla1.8 builders but firing off builds when it sees trunk checkins; this is because I want to make sure the Bonsai polling is working and Mozilla1.8 doesn’t get very many checkins :) Also, I’m trying to stay out of the way of the ongoing trunk automation work that bhearsum is driving (AKA, letting him find and fix the trunk+release_automation bugs before I add nightly support). Expect to see trunk nightlies up there in the next few weeks.

This has several advantages right off the bat:

  • same release process we use for production (currently a very small subset
  • only build on checkin, should help cycle times
  • same builders used for nightly and production releases (admittedly, this is how it used to be before release automation; but now we can let Buildbot handle the queuing instead of either interrupting depend/nightly builds or running multiple builds on the same machine, which is slow)

As we continue to make the nightly and final release process more alike, we can start taking advantage of things like that only final releases have but are missing on nightlies:

  • updates for l10n (only en-US gets updates currently :( )
  • update verification
  • publishing the source tarball used to buld
  • using the same timestamp for all platforms

On the administration side, it should let us manage builders centrally, more quickly and easily deploy new builders, and with a little more work will let us parallelize builds (multiple build machines per column, or buildslaves per builder in Buildbot parlance), which should further help cycle time (no having to wait for the current build to finish to get a build started with your fresh checkin).

Comments/questions/concerns welcome! Feel free to email me, the build group, or post in mozilla.dev.builds newsgroup if you’d like to discuss further.

Migrating Tinderbox to Buildbot

Posted by rhelmer on 25 Nov 2007 | Tagged as: automation, buildbot, mozilla

I’ve started working on migrating the Firefox nightly builds to use the same release automation system that we’ve been developing for the past year or so for maintenance releases (Firefox and Thunderbird 1.5.0x and 2.0.0.x). The reason this is important is that each nightly release (installer, update, etc.) is practice for the real thing, and we should be using the same tools and verification processes wherever possible (right now both Nightlies and Releases use Tinderbox client [version 1] for build and l10n repack, all other aspects of the release process are not tested until the first release candidate. Well, we have a staging server that tests the release automation in isolation, but it’s not the same as having real nightly testers looking at the results :) ).

The scope of the current release automation framework (Bootstrap) has been to leave as much of our existing process in place as possible, and not try to simplify or optimize. This kind of low-risk approach is the right thing to do when you’re overhauling the release process on a maintenance branch, but it has created many layers of frameworks:

Buildbot->Bootstrap->TinderboxClient->MozillaBuildSystem

As you can imagine this can be quite nightmarish to debug and add features to. I believe strongly in backwards compatibility and incremental development, but the Bootstrap and TinderboxClient client layers are largely invisible to anyone outside of the Mozilla Build&Release team.

I think where we really want to be is:

Buildbot->MozillaBuildSystem

Wherever possible, Buildbot should do the same things a developer would do, and the configuration should be as clear as possible to read and modify.

I have some thoughts on how to get there on the wiki. The first step is to slot Bootstrap into place which is actually pretty easy as it just calls Tinderbox Client anyway. The larger work here is moving to the more direct “Buildbot->MozillaBuildSystem” scenario, for which I have a working prototype and it’s configuration, if anyone is interested in seeing more.

Note that I’m not talking about changing Tinderbox Server or any of the existing mechanisms that developers use to clobber builds or check build status. bhearsum added Tinderbox Server and Bonsai support to Buildbot a while back, so builds show up on Tinderbox and we can configure them to be triggered only on checkin (as opposed to the continuous loop that Tinderbox Client currently does).

I have started a newsgroup thread in mozilla.dev.builds (subject: “Migrating Tinderbox to Buildbot”), please follow up there if you’d like to discuss.

EDIT – fix typo

Next Page »