Yesterday, I asked the musical question “Just how fucked am I?”. I woke up to the answer this morning – sometime around 3:30am my domU stopped working. Good bye web sites, good bye mailing lists, good bye picture gallery, good bye blog. Dammit.
I emailed Dave at Anexxa, and found out that when they’d moved to the new rack, they hadn’t labelled the ports on the managed power, so they couldn’t power cycle my machine. But they were going out to the facility at 1:30pm and could do it them. So I volunteered to come out with them. Probably just as well that I did.
The box power cycled fine, and I thought I’d tried the upgrade from there at the terminal, thinking maybe apt turned off the network connections just before it asked a question or something. They’ve done dumber things – remind me to tell you the story about trying to upgrade Morphix from an X console some day. No dice – it got to about the same place, but hung just as hard. I suppose if I really want to do the upgrade, I’ll have to try booting to the non-Xen kernel and do it from there, but for now I’ve marked the kernel packages as “don’t upgrade” and I’ll leave it alone until I have occassion to be back at the colo.
Unpacking linux-image-2.6.18-4-686 (from …/linux-image-2.6.18-4-686_2.6.18.dfsg.1-11_i386.deb) …
My colo box consists of xen.xcski.com, the dom0 which controls the others, and then xen1, xen2 and xen3 which are the domUs. Because it was way easier to do it this way, the dom0 is running Debian “etch” (aka “testing”), while the domUs are running Debian “sarge” (aka “stable”). The problem with using “testing” is that there are frequent updates, way more frequent than with “stable”). The problem with remote updates is if something fucks up, there isn’t any easy way to fix it. Usually that’s not a problem.
Today’s upgrades include a new xen kernel. But it says it’s installing a new kernel, leaving the existing one there. So it shouldn’t be a problem, right? Well, I was wrong. It downloaded the upgrades, then got to the “unpacking” stage and hung. I can’t ssh to the dom0. I can’t kill the upgrade. It’s not responding to the munin probes. The only thing I can think of is doing a power cycle and maybe scheduling a site visit. But the domUs are running fine. So why would I do anything drastic while the real meat of the colo box is still going fine?
I don’t know what to do. Wait and see, I guess.