No such thing as a smooth upgrade.

My colo box has started exhibiting this strange behaviour:

  1. My “guest” (aka domU) OS will stop talking to the network. I can still log into it by going to the “host” (dom0) OS and issuing the xm console xen1 command.
  2. The guest still thinks it’s connected to the network. ifdown eth0; ifup eth0 doesn’t accomplish anything.
  3. If I reboot the guest, using shutdown -r now "", reboot, or, from the host, xm shutdown xen1; xm create xen1.cfg doesn’t come back up. xm gives an error about being unable to reserve enough memory.
  4. If I reboot the host, it doesn’t come back, and I have to either go into the colo or put in a trouble ticket, wait a few hours and then phone them up to ask why they’re ignoring my trouble ticket. They always respond that they’re really swamped right now and they must have missed it in the rush. When I go in, they’re always bored out of their minds and playing games. Oh, and good fucking luck finding a phone number anywhere on their web site. I only found one because I had it in my phone from before they were taken over by Earthlink.

When it was happening every 4 or 5 months, I wasn’t worried. When it happened twice in one month, I got worried. When it happened again 3 days after that “twice in one month”, I’m really worried.

Thinking that this might be a Xen problem, I decided to upgrade the host OS from Debian 6 to Debian 7. Mostly, it worked just fine except for two “small” problems:

  1. I couldn’t figure out how to make it boot the Xen stuff automatically and
  2. When I manually booted the Xen stuff, the network wouldn’t come up

The first problem is due to the way they re-arranged the grub menu – all the Xen stuff is under a submenu. The recommendation I found was to use dpkg-divert --divert /etc/grub.d/08_linux_xen --rename /etc/grub.d/20_linux_xen to put the Xen stuff ahead of the non-Xen stuff in the Grub menu. That seems like a cheezy hack, but I’ll take it for now.

The second problem appears to be because of changes in the way Xen does bridging – evidently they bring up eth0 before /etc/network/interfaces brings it up, or something like that, and everybody gets all confused. The extremely dubious hack I found on-line to fix that is to add a pre-up ip addr del xx.xxx.xxx.xxx/255.255.244.0 dev eth0 || true to the definition of eth0 in /etc/network/interfaces. I suspect a better long term answer will be to figure out how to set up the proper bridging for the Xen stuff.

Now that’s all hacked together to work, fingers crossed that it actually reduces the freeze-up problem. Meanwhile, all the guest OSes are still running a 2.6.32-5-xen-amd64 kernel and I’d like to switch to a 3.2.0-4-amd64 kernel. Hopefully I can do that without another long night of standing in a hot colo facility.