God save me from impatient installers
I’ve spent the last couple of months doing an all-singing, all-dancing automatic upgrader for our customers sites. This process is designed to be totally hands-off - you stick the DVD in the drive and type “upgrade”, and at the end of the theatre day it will convert the main “cms” computer (one per site) and all the “cp” computers (one per projector) from Redhat 7.3 to CentOS 3.4, upgrade from version 3.3 to 3.6 of our software, and magically preserve all your settings and configuration. You should come in the next day to find everything ready for the day’s schedule.
For the very first one at a customer site, though, they sent out a technician to babysit it. Unfortunately they send a techician who’d never done or witnessed one of our many test upgrades in-house.
You probably guessed what happened - she saw the cms come up, didn’t realize that the cps start after the cms is done, and rebooted the cms at the worst possible time - right when all 18 cps were attempting PXE (network) boots and expecting the cms to be there to send them what they needed. And the cms doesn’t start dhcpd by default, so the cps have had nothing to talk to all night. And of course everybody is screaming for me when I got in!
September 15th, 2005 at 21:17 GMT
“screaming for” beats “screaming at” any day.
March 7th, 2006 at 19:07 GMT
[...] One of the worst tasks I’ve had at this job is working on the automatic upgrader. I hate doing it, because it’s not so much “programming” as it’s “cobbling together a bunch of system administration stuff”. I got it working as well as I can, but there are some various flakey problems in the way RedHat/CentOS works, as well as some dodgy Dell hardware, that I can’t make it work 100% of the time. I’ve written about it before. I get called in whenever something fails to try and forensically engineer what went wrong. Today’s fuckup was very similar to the one in that linked article - somebody started the upgrade before they went home at night, and somebody else came in in the morning and started it again. That left some things half installed and half upgraded, and some of the “cp” machines decided that they were being “plex built” (built from scratch in the manufacturing area) rather than upgraded, so they all made themselves into FRU (field replacement units) and shut down. Of course it took me nearly an hour to figure out what the idiots had done and how to fix it. And the upshot is that because these machines are now “bare” and physically powered down, somebody has to go out to the site and set them up. Oh, did I mention that the fuckup also caused all copies of the saved configuration for the entire site to be lost? [...]