Can’t always be the hero

I am mostly responsible for the scripts that upgrade our software from one version to the next. For the most part, it’s pretty straight forward, since we use apt-rpm to take care of upgrading rpms and making sure their dependencies are fulfilled. There is a bit of a hitch in that some of our rpms were made by people who don’t really get rpm, so the rpm just installs a tar file and the %post script unpacks the tar. Trust me, that’s more bizarre than you think, even if you think it’s pretty bizarre. Because of that and because of dependencies, we can’t just do a “dist-upgrade”, but have to upgrade our rpms one at a time.

But the biggest twist on the upgrade is that going from 3.3 to 3.6 of our software involves going from RedHat 7.3 to Centos 3.4. (Don’t ask what happened to versions 3.4 and 3.5 of our software – it’s too painful.) I am not quite proud of the horrible hack I put together to do that, but it took a bunch of work to get it working so that they can just put a DVD in the drive and type /mnt/cdrom/upgrade now and it mostly works. It uses grub on the hard disk to boot the DVD with a custom kickstart file, formats all the partitions but one, installs the new CentOS, and then uses a custom finish script to reinstall our software and restore the backed up configuration off that one partition that wasn’t formatted.

Actually, as an aside, all of the upgrades only “mostly work”. Partly that’s because we were stupid enough to put Dell computers on the customer sites, so it’s hit or miss whether they’ll even reboot after a year or two’s continuous use. But mostly it’s because rpm has an intermittent bug in the locking code that RedHat was told about at least 3 years ago, and still hasn’t fixed. Which means that sometimes apt-get fires up rpm -U, and rpm just hangs. And of course I get blamed because the upgrade doesn’t entirely work, although I think they’re starting to realize it’s not my fault.

When the upgrades don’t work, I usually get dragged in by the customer support people to log into the customer site to fix it. And usually it’s pretty straight forward – reboot the machine with the locked rpm database, and then manually step through the steps the upgrade script would have taken on that machine. Although the customer support people have recently learned that they can just re-run the whole upgrade script, because the apt-get install portion won’t do anything if it was already done.

Today I got dragged in because of a bigger problem – a customer site that is running version 6.2 of our software had a hardware crash on the main server. They sent the customer a new server, but for some stupid reason the fulfillment house that sent out the new server had version 3.3 of our software installed on it. I guess customer support sent them a 3.3->3.6 upgrade DVD, and tried to upgrade it to 3.6 remotely, and then upgrade from 3.6 to 6.2. But at that point they noticed nothing was working.

I investigated, and discovered that most of the rpms weren’t installed. Also, nothing was backed up properly in the partition that doesn’t get reformatted, so it hadn’t been restored correctly. So I decided to go back to the 3.6 version to see if I could get it working there and then go forward. Fortunately they’d left the 3.3->3.6 upgrade DVD in the drive. I ran the /mnt/cdrom/upgrade script and came back a few hours later. And sure enough, only 3 of the 8 rpms that are normally installed were installed. I tried to manually install them, but the first one failed because it had a dependency on the mozilla rpms, and the mozilla rpm was corrupt. It didn’t matter whether I tried the one in the apt repository, or on the DVD itself. It was hosed.

At this point, I gave up and said that they’d have to ship out another replacement unit with the proper version of our software installed. So much for being the hero.

But as I was leaving work, I had a few ideas on how I could have repaired that mozilla rpm and gotten it working. But I was too late. I figured even if I went back everybody was gone as well, and besides they’d already ordered the replacment – my reputation as a hero is ruined. Sigh.

I guess that answers that question.

Yesterday, I asked the musical question “Just how fucked am I?”. I woke up to the answer this morning – sometime around 3:30am my domU stopped working. Good bye web sites, good bye mailing lists, good bye picture gallery, good bye blog. Dammit.

I emailed Dave at Anexxa, and found out that when they’d moved to the new rack, they hadn’t labelled the ports on the managed power, so they couldn’t power cycle my machine. But they were going out to the facility at 1:30pm and could do it them. So I volunteered to come out with them. Probably just as well that I did.

The box power cycled fine, and I thought I’d tried the upgrade from there at the terminal, thinking maybe apt turned off the network connections just before it asked a question or something. They’ve done dumber things – remind me to tell you the story about trying to upgrade Morphix from an X console some day. No dice – it got to about the same place, but hung just as hard. I suppose if I really want to do the upgrade, I’ll have to try booting to the non-Xen kernel and do it from there, but for now I’ve marked the kernel packages as “don’t upgrade” and I’ll leave it alone until I have occassion to be back at the colo.

Just how fucked am I?

Unpacking linux-image-2.6.18-4-686 (from …/linux-image-2.6.18-4-686_2.6.18.dfsg.1-11_i386.deb) …
Done.

My colo box consists of xen.xcski.com, the dom0 which controls the others, and then xen1, xen2 and xen3 which are the domUs. Because it was way easier to do it this way, the dom0 is running Debian “etch” (aka “testing”), while the domUs are running Debian “sarge” (aka “stable”). The problem with using “testing” is that there are frequent updates, way more frequent than with “stable”). The problem with remote updates is if something fucks up, there isn’t any easy way to fix it. Usually that’s not a problem.

Today’s upgrades include a new xen kernel. But it says it’s installing a new kernel, leaving the existing one there. So it shouldn’t be a problem, right? Well, I was wrong. It downloaded the upgrades, then got to the “unpacking” stage and hung. I can’t ssh to the dom0. I can’t kill the upgrade. It’s not responding to the munin probes. The only thing I can think of is doing a power cycle and maybe scheduling a site visit. But the domUs are running fine. So why would I do anything drastic while the real meat of the colo box is still going fine?

I don’t know what to do. Wait and see, I guess.

Another fun day

My colo facility contacted me on Wednesday to say that this weekend they’d be moving my machine to a new rack, and also that they’d gotten a new IP range and I had to switch over to the new range soon, but they’d let me have both IPs for the switch over.

So today my system suddenly went off the air. I was sort of expecting it, but I didn’t see any shutdown messages because they just three-finger-saluted it. After a couple of hours, I phoned for an update, and was told that they’d just powered it up. But it still wasn’t responding to pings, until I mentioned to them that eth0 and eth1 are in the opposite order than what you’d expect.

Once it came up, I tried to configure an eth0:1 using the new IP. That actually seemed to work on the dom0, so then I tried to do it on my domU. That seemed to work too. I was able to ssh into both ips on both the dom0 and the domU. So I thought I’d swap the domU ips around, so the new one was eth0, and the old one was eth0:1, which would make it easier to get rid of the old one when I don’t need it any more. So I changed it in /etc/network/interfaces and rebooted.

But then suddenly things started going pear shaped. The domU was refusing to boot with an error about being unable to find /dev/hda1. On the dom0, “ifconfig” would just hang. And then it stopped responding at all. Now I was in full panic mode. I called Annexa and Dave called the guys who were doing the rack move and convinced them to go back to the facility. I met them there, and found that my poor box wasn’t even responding on the KVM. We power cycled it, and found that it wasn’t starting the domUs, and also that while it started up eth0 and eth0:1, it didn’t start the virtual bridge interfaces (peth0, vif0.0, vif7.0, vif8.0, vif9.0, xenbr0). That’s not good. It appears that Xen doesn’t like the extra interface or something. So I got rid of eth0:1, changed eth0 to the new IP, and rebooted. This time, it started up and so did the domUs.

I was still having a bit of problem with my personal domU – it didn’t want to resolve. Evidently somewhere along the way I’d decided to remove this program “resolvconf” that is supposed to maintain your name resolution for you, and when I did it had replaced my resolv.conf with one that looks like it was copied from my home machine. So I fixed that and things sort of worked, but in spite of the fact that I had the old IP on eth0:1 it wasn’t answering on it.

So it looks like I’m up and running, but I can’t use the old IPs. So you’re not going to see this until your DNS cache updates and you see the updates I made over at zoneedit.com.