on moving to buildbot for reals

People are often very confused by the state of where Mozilla is with regard to Tinderbox versus Buildbot. They are both continuous integration systems, and you’d think that just jumping wholesale would be easier than the unholy marriage I’ve described in the past.

The big distinctions are these:

  • server vs. client - Buildbot clients and server are tightly coupled, and communicate through an active TCP connection (managed by Twisted). Tinderbox clients simply send email to the server, one for build start and one for build stop (build stop has the status specified, which changes color on Tinderbox server). The logfile for the build may be attached to the “end” email.
  •  Tinderbox server vs. Buildbot server - tinderbox.mozilla.org puts up with a lot of load. Buildbot server can probably not handle this. Also, Tinderbox server has a bunch of features that Mozilla developers depend on, like setting status, etc.

Personally I feel that Tinderbox is the wrong way to visualize what developers actually need, but I’ll save that for a later and more productive post :) For now, suffice to say that Tinderbox server does a lot more and can handle way more load than Buildbot server.

However, Buildbot server does have some very nice qualities, like being able to see the log in real-time, and being able to stop and force builds. So, an interim solution is to have Buildbot server send email to Tinderbox server on behalf of it’s clients, so you get Buildbot as an administrative, developer-only interface, and Tinderbox server as the general, public interface.

The 1.8 and 1.9 nightly builders are already exposed to nightly users; there are a couple kinks to work out, so I won’t link to it right now (I’ll let the people that are actually maintaining it do that :P), but the glorious future is that developers can stop and kick builds as well as see real-time logs.

So, that’s all well and good, and I think fairly well understood. Now here’s the hairy part - the 1.8 and 1.9 nightly Buildbot clients are turning around and calling Tinderbox! WTF! (note that the unittest and moz2 buildbots do not do this, only the 1.8/1.9 nightly boxes). This is because Tinderbox client contains code to do a bunch of things:

  • mozilla-specific build process
  • performance testing
  • create updates
  • publish updates (nightly AUS only)
  • rebooting windows 9x between builds (not joking)
  • support for a bajillion products and platforms (mostly through huge “if” blocks)
  • support for hybrid depend/clobber builders
  • support for uploading to various locations on FTP
  • much, much more

Some of these features are very useful and not available elsewhere, and some are obviously not useful anymore. The error and log handling leaves a lot to be desired; it’s not something trivially fixable, unfortunately (lots of people have tried, resulting in not one but two attempted rewrites).

Getting all of the useful bits of this into Buildbot has been a real challenge, but Ben Hearsum has all of the important bits worked out for moz2. I’m hoping to spend some time packaging that up as a BuildFactory, to make it easy to reuse this code for other branches and products (mostly because I’d really like to see bug 421586 get fixed), strictly as a community member of course :)

You can read more about Buildbot process-specific factories (that’s a nice example of what a GNU Autoconf style project could use, which comes with Buildbot) but suffice to say it’s a way of encapsulating the basic build process so you don’t need to copy and paste “cvs co client.mk”, “make -f client.mk MOZ_CO_PROJECT=blah” for each builder in your Buildbot master.cfg

This brings up the other big missing piece, which is that Buildbot’s awesome Source class can’t be used because it doesn’t understand that it can’t just update the whole “mozilla” CVS module, but needs to use the client.mk instead. This means that built-in clobber support and the built-in “tryserver” support can’t be used (the current Mozilla implementations have a lot of custom code).

Bug 414031 suggests a possible way to implement support for it. Although it’s kind of a pain to implement, using a driver script like this is fairly common in Java projects, so I think some kind of generic support might be feasible.

If you’re not sure what I’m talking about here and why Source can’t be used out of the box, the client.mk only does a partial checkout of the “mozilla” CVS module depending on which MOZ_CO_PROJECT is specified. Also, it can and does check out different versions of subdirectories, such as NSPR and NSS.

In other words, this is not your typical “checkout module && ./configure && make” project, although it is deceptively close in some ways :) It’d probably be better to have basic support for this flow, just based on principle of least surprise. I think that it also has material effect on tool support and new developers, too.

Add comment April 8th, 2008

learning strategies

Since I’m going to be doing GIS programming and working with scientific mumbo jumbo again, I’ve recently started brushing up my rusty (and in some cases non-existent) math skills. I recently stumbled somewhat serendipitously on the Steve Yegge blog rant Math for Programmers.

Warning - His posts tend to be very long, I rather enjoy them but I see tl;dr in some of your futures :)

Anyway, a nice reminder for me is this little section:

The Right Way To Learn Math

The right way to learn math is breadth-first, not depth-first. You need to survey the space, learn the names of things, figure out what’s what.

This is absolutely true, for me anyway. I’ve started doing math every day in addition to music (time spent on both are of course are dwarfed by baby-care duties, but I can carve out a few solid hours per day here and there :P).

I just spent the past week going through:

Introduction to Algorithms

I really enjoyed this (and other algorithm books, but especially this one) last time I read it (several years ago). However, I didn’t really take the time to do most of the exercises, and just skipped the math that went over my head, or that I didn’t recall readily.

This is fine for a first pass, but the difference this time is that I’m taking notes on bits I don’t understand and following those up, and committing to coming back and working through the exercises to make sure that I grok it. JS might be a fun language to use for this :)

I’m finding algorithm analysis way more interesting this time around, for some reason.

Next on my list:

Concrete Mathematics

I tried reading this when getting hung up in parts of “The Art of Computer Programming”. “Introduction to Algorithms” also refers to this book a lot. It’s pretty dense. I’m thinking of jumping back to TAoCP once I do the first pass, and returning to specific areas as-needed, as I find progress in TAoCP very rewarding.

Focusing on the breadth-first strategy is making a lot of sense to me; it’s way easier to find the solution to a problem when you know how to recognize the type of problem, as opposed to the rote method of memorizing algorithms that I’ve found in many (but not all) classrooms.

This is highly motivating stuff:

And I’ll keep getting better at this. I have lots of years left, and lots of books, and articles. Sometimes I’ll spend a whole weekend reading a math book, and sometimes I’ll go for weeks without thinking about it even once. But like any hobby, if you simply trust that it will be interesting, and that it’ll get easier with time, you can apply it as often or as little as you like and still get value out of it.

That’s exactly how I feel about music and computers (programming, server stuff, etc) in general, I don’t know why I’ve always had such a block against math, it has always felt like more work than fun (well, except Geometry. Visuals ftw!).

I think that doing simple games can be my “in” to getting along better with Physics. I can’t remember ever _wanting_ to remember trig before I threw together that little Breakout! demo :) Now I am kicking Past Rob for not paying more attention. I guess the most I can hope to do is spare Future Rob that same trauma.

Add comment April 8th, 2008

rel-o-mation slideware!

I put this set of slides together to explain what state the release automation project is in. It probably makes more sense when I am sitting there to explain what each point means, but I figured I’d put it out there anyway :)

The current setup mimics ye olde manual release process, forged by Chase. Over the past few years we’ve worked on wrapping that process in scripts with this perl framework (aka “bootstrap”), which auto-generates configs for underlying systems like tinderbox and patcher, checks logs for errors, etc.

A lot of the current bugs come from underlying systems, especially the tinderbox client. Reducing some of the complexity here would both make the system more understandable and most likely faster as well. It’s pretty tough to make changes when you’re doing this level of wrapping, too.

Now that everything is driven by Buildbot, it probably makes the most sense to call the build system directly, instead of buildbot->bootstrap->tinderbox->build_system that we have today. There are bugs on all of this already, hopefully the slides and this post will make it clearer how they tie together.

Add comment April 4th, 2008

leaving MoCo

My last day at Mozilla Corporation will be April 14th. It’s been an incredibly awesome experience, and I’d like to thank everyone for the support, guidance and encouragement I’ve received.

I’m going to be taking some time off before starting work at CustomWeather, Inc. I won’t be working directly on any Mozilla projects, but I do hope to help out from time to time. I will be available to help with Firefox 3 for sure.

I’ve changed my bugmail address to my personal email address - robert@roberthelmer.com - feel free to contact me.

4 comments April 1st, 2008

Breakout!

I was working through some Pygame tutorials last week and thought it’d be fun to see if Canvas/JS was fast enough in Fx3 to do some simple games.

So, I spent a couple evenings last weekend and made a really dumb Sprite class, and stole some reasonable “breakout physics” from this tutorial to make this Breakout clone in JS.

The collision detection for the bricks is a little sloppy (there’s a little damage on bricks from time to time) and I haven’t done any perf work yet, but it seems to work ok  in Fx3 nightlies on my MBP. Safari works ok too, just not quite as fast.

Any activity in other tabs seems to have a huge impact on performance, there’s probably a better way to do the sprite maneuvers etc. but I only had a few hours to spend on this so far. Pointers welcome :)

12 comments March 19th, 2008

moving 1.8 nightlies to release machines March 5 2008

As previously announced on Tinderbox and planet, we’re migrating nightly production to running on the same machines as release production.

On the moz1.8 branch, we’ve been running the new nightlies in parallel with the “traditional” nightlies since Feb 15 2008, and are going to switchover live tomorrow.

The new machines:
* production-pacifica-vm
* production-prometheus-vm
* bm-xserve05

The old machines:
* pacifica-vm
* prometheus-vm
* bm-xserve02

Starting tomorrow, the performance machines will begin following the new machines. The new machines will publish updates and nightly builds to the usual location, and the old machines will be disabled (but kept around for a while, just in case).

If there is a reason that we should not proceed, or if you see any problems after the migration, please update bug 417147 or email build@mozilla.org.

Thanks!
Rob

1 comment March 4th, 2008

moving nightly Mozilla1.8 Firefox to release automation system

I’ve just enabled nightly builders from the release automation system on the Mozilla 1.8 tree (see bug 417147 for details).

I’ve blogged on this previously, but just to reiterate some of the reasons:

  • unify the (very fragmented) nightly and final release processes (tools, procedure, etc).
  • move away from Tinderbox client to Buildbot
  • use the same set of machines for both nightly and release

The first point is a really big one for me, using totally different tools for nightly and release means that we don’t get much testing of our release-only procedure and tools, so we often hit unexpected bugs on release day, and it also leaves nightly users without the benefits we provide for releases like automated update verification, updates for all locales, thorough error checking and monitoring of build machines, automated staging runs before pushing changes live, for a start.

The current setup still uses Tinderbox, it’s just being invoked by Buildbot, so developers should notice no change besides new hostnames. We’re trying this out on 1.8 branch first before we tackle 1.9, so far it has been quite smooth but please let us know if you notice anything out of the ordinary. We have not switched over perf tests yet, but we expect the results to not change (although we may want to merge some graphs for developer convenience, etc). This will happen before the old machines are turned off.

We’re planning on turning off the older 1.8 builders sometime after February 25th, so please do let us know if you see any problems. I’ve left a note with the names of the new builders at the top of the Mozilla1.8 Tinderbox tree.

This is only one tiny step towards improving life both for the build&release group and also developers and nightly testers, but it’s quite significant from an infrastructure point of view, and has been brewing for a long time. I’m not sure what the next steps are going to be, but I’ve written up some thoughts on where I think we should go and why.

2 comments February 14th, 2008

volunteering

Thought-provoking post over on isabel wang’s blog:

When you ask for X hours of someone’s time to help put up pre-made signs or read off telemarketing scripts, each volunteer means no more to you than just another undifferentiated source of labor. No one is put to their highest and best use.

Regardless of what you think of the presidential candidates and the general state of American politics, I think that focusing on what the volunteer is bringing to the table just makes sense when your organization depends on volunteers.

A point that a lot of companies miss about people who volunteer their time and energy towards open-source projects like Mozilla is that volunteers are not coming to finish your to-do list for you, they are trying to make their world a better place by using you as the vehicle. They may not all do what you want or represent you the way that you’d like, but hopefully if everyone is coming from the same core set of ideals then everyone ends up richer for it.

Add comment January 21st, 2008

tinderboxJsonApi 0.1

Many people have told me that they were excited about the JSON Tinderbox feed, but were quickly discouraged from doing anything fun due to the scary data structure that it presents; it’s a straight dump of what the server uses, and is obviously optimized towards making a waterfall display (plus, it’s just plain weird).

I set up an enhanced waterfall as an example a while back, but it’s really hard to take it further without spending a lot of time digging around inside the tinderbox_data object.

I’ve often wished that I could just sort by column in Tinderbox, so instead of doing yet-another one-off script I put together a little web app that gives you a sortable table of the latest (non-talos) perf data: Analysis paralysis

Click on the headers, and you get data sorted by your criteria. The data is real-time, but does not auto-reload.

I started to hit a wall almost immediately due to the machinations required for the tinderbox_data structure, so I stepped back and took some time to write a tboxJsonApi.js instead of dealing directly with the data from Tinderbox. This lets me write code like:

<script src="http://tinderbox.mozilla.org/Firefox/json.js">
<script>
tree = new Tree(tinderbox_data);
builds = tree.getBuilds();

for (i in builds) {
  build = builds[i];
  build.getName();
  build.getStartTime();
  build.getStatus();
</script>

You can get checkins for a particular build, or test results (the scrape data is processed, right now it only supports anchor tags with “key: value” format link text, which is why Talos isn’t yet supported).

There’s a bunch more stuff I want to do before this will be generally useful to me, e.g. CSV export, merging all build, perf and test data for a checkin into one row, etc. but I think it’s obvious that we could have more useful tools for tracking and analyzing the absolute mountian of data that mozilla.org produces every day.

Let me know if you find this useful, and/or have any questions or ideas for improvements. I was able to throw this all together in a few hours this evening, because I spent so much less time wrestling with data structures and more modeling the kind of app I wanted.

2 comments January 17th, 2008

summarizing build-on-checkin feedback

Lots of feedback on the build-on-checkin idea in my blog, the newsgroup, and especially joduinn’s recent post on the subject. The primary concerns seem to be:

  • we need as many performance tests per checkin as possible

I’ve filed bug 410869 to track this. I think the way we do this now is wrong, and we’d get more performance cycles if we fixed this by separating the start time of the test from the revision that the test is for. Also, we should do a separate perf test for each checkin, not just the latest when the perf machine becomes available, to be able to track down regressions to a specific changeset.

  • sometimes the build breaks for non-checkin reasons, and someone needs to be hunted down to correct it if it’s build-on-checkin

I think this is mainly a fault of not having adequate monitoring, auto-recovery, and load-balancing of the server farm, and not giving the right people access to force builds directly. bhearsum is rocking the monitoring side in bug 410019 so we’ll know as soon as anything goes wrong at the machine level, and Buildbot can do the load-balancing and give developers an interface to force/clobber/stop builds as needed, without having to give everyone in the project a shell account or wait til the next checkin to pick up a CLOBBER file.

  • some people will still be stuck waiting for build cycles, this just moves the problem around

I think this is absolutely a valid concern, and the more I think about it, build-on-checkin isn’t really all that valuable until we have multiple buildslaves able to run in parallel, so no one has to wait for the current cycle to finish in order to have their checkin tested. bug 411629 has been filed to track this.

  • CVS commits are not atomic, what if we pull a partial checkin?

Fortunately this goes away when we switch to hg for Moz2, but even for 1.8 and 1.9 branches we poll Bonsai (and can use the revision, aka branch+timestamp) that it contains, instead of just blindly pulling CVS. I don’t *think* that Bonsai is susceptible to this kind of thing due to the way it groups checkins before reporting them, but please correct me if this is wrong. Also, isn’t this a problem today, since Tinderbox client just blindly picks a timestamp and pulls it?

If I’ve missed or misrepresented anything, please let me know, and check out the dependency tree on bug 401936 for more information.

8 comments January 9th, 2008

Previous Posts


Categories

Links

Feeds