The continuing saga of replacing disks

Well, after last Monday, I realized that I wasn’t backing up all the files I’d need on the new system, so I had to change my nightly backup. Unfortunately that caused the nightly backup to take so long that it interfered with the next night’s backup, so when I got home from 2 days in Ithaca I discovered my nightly backups were kind of screwed up. I now have that mostly fixed up.

I copied the lastest backups over to the new server and started up the three guest OSes, only to find that one doesn’t start right, because it appears to be missing scripts from /lib/init/. I copied them over from another guest and it seems to be starting, as well as I can tell.

I also recently discovered the “H” option to rsync to preserve hard links. This might be useful, although it seems to conflict with “link-dest”.

Current plan is to make another backup of the three guests, and rsync that over. Then shut down the three guests, and rsync that over. Then whip out the disks and go running over to the colo facility and try to get them in. Depending on how long the first step takes, I might wait until tomorrow.

That’s better!

On my test system, with the two new disks and with the three domU guest systems set up and running (but not connected to the outside world, so the load is obviously lower), I run tiobench in the dom0. Executive summary:
read/write rates around 5 times faster, and no awful latency. The worse latency on the Caviar disks was 121588.99 ms, and on these disks it’s 6077.56 ms.

I can’t wait to get these disks over to the colo. If all goes well, it will just be a simple matter of shutting down the domUs, making a final copy from the old disks to the new disks, and then swapping the disks.

Full details below the cut.
Continue reading “That’s better!”

New disks set-up

Having established that there is *something* making the disks run slowly on my colo box, I am resolved to fix it. One of my xen “tenants” generously donated two new Hitachi Deathstar^WDeskstar disks. In order to save some downtime, I’m setting up the new disks on the server that I replaced because I thought it was causing hardware problems (but which may or may not have been due to the crappy disks I was using). Setting up essentially a new server means I also have a chance to try out Debian 6, which became the “Stable” release a few weeks ago but which I haven’t had the nerve to upgrade the colo box to.

Fortunately, I have my previous post on Another try at setting up the new server to act as a checklist.

Day 1

  • Downloaded and burned the Debian 6 NetInst disk for AMD64.
  • Installed the new disks in the old box and booted from the NetInst disk
  • Just in case they fixed the problems with lvm and software RAID and grub not playing nice together, tried installing as a two disk software RAID-1 with LVM on top of that
  • Installed with 4Gb root partition and 2Gb swap on LVM
  • One of the install options was “SSH Server”, and so I choose that one
  • Success! It boot with Grub with that configuration.
  • Discovered that ssh installation dragged in xauth and a bunch of X11 libraries, so removed those.
  • Installed smartmontools and enabled them in /etc/defaults/smartmontools
  • Installed xen-utils and kernel and all the stuff that drags in.
  • Rebooted and discovered to my relief that it boots the xen kernel.
  • Installed rsync for backups.
  • Installed munin-node and munin-plugins-extra.
  • Installed vim and removed vim-tiny.
  • xm list isn’t working. Tried to manually start xend and got a screen full of errors. Tried to start it with the /etc/init.d/xend start and nothing happened.
  • Discovered it’s not starting xend because Grub is booting the xen kernel without the Hypervisor. If I choose the correct entry off the grub list, I get it. Now to figure out how to change the boot order in this new version of Grub.
  • Took my backup disk and added it to the third drive sled so I’ll have SATA speeds when I restore from it.
  • Edited /etc/default/grub and changed the GRUB_DEFAULT value to 4 (remember they’re numbered from 0) and then ran update-grub.
  • Copy ssh configuration in /etc/ssh/ and ~root/.ssh from backup.
  • Copy munin-node configuration in /etc/munin/
  • Uninstalled exim4 and installed postfix because I know how to configure postfix.
  • Copy postfix configuration from backup.
  • Oops. Need the hostname configuration to match the hostname in postfix.
  • Create lvm volumes with lvcreate -L 150G -n xen1-disk xen-disk.
  • Create file systems on them with mkfs.ext3 /dev/xen-disk/xen1-disk.
  • Create swap with mkswap /dev/xen-disk/xen1-swap.
  • Installed ntp
  • Copy backups with rsync -aSurvx --numeric-ids --delete /mnt/sdc1/mp3s/ /mnt/mp3s/.

And at this point, while restoring the data from the backup to the disks, it started throwing SMART errors. Which at least vindicates our purchase of new hardware to replace this box. I was starting to worry that the problems we’d seen on this hardware were entirely due to the same disk problems we were seeing on the new hardware.

Continuing on:

  • Reformat the partitions with mkfs.ext3 -c.
  • Still get the error on restoring the backup.
  • Deleted the lv that was causing the problems, and tried creating a bunch of smaller ones.
  • Make file systems on the smaller (50G) lvs and rsynced about 45Gb of data onto each one. Didn’t get any errors, so wondered if the errors were coming from the source disk.
  • Did a tar cvfz /dev/null . of the backup that was throwing the errors. That didn’t give any errors either.
  • Removed the “junk” lvs and created the big one again. Did a mkfs.ext3 -c on it.
  • rsyncing the data over got the error on the same file again. And this time I’m almost sure it’s the backup disk, not the destination.
  • Tried to copy the offending file to /tmp, and got the same error. So yes, it’s the backup disk.
  • At this point, I have enough of the system restored that it’s painless to do the rest of the rsyncs from last night’s backups on my home server. So that’s what I’m doing. I’ve done rsync -aSurvx --numeric-ids --delete xen1/Sun 192.168.1.119:/mnt/xen1 and it transferred about 10 files and deleted a couple of postgresql log files.

With that all done, it was time to get serious about setting up xen and running the domUs.

  • Copied the domU configuration files from backup to /etc/xen.
  • Modified them for the new kernel version (hey, is this the version with no global locks? That could be a huge win). Copied the appropriate /lib/modules/ into each of the domU directories
  • Tried to start a domU. It complained about being unable to start the network. Copied a line out of the backup of /etc/xen/xend-config.sxp to the new one.
  • Tried to start a domU. Ran out of memory.
  • Remembered that the live site has 8Gb but this only has 4Gb, so reduced the size of the memory allocated to each domU.
  • Tried to start a domU. It gave a bunch of errors about being unable to start the raid and the lvm. Thought about it for a while, and realized that since I’m specifying an initrd in the config file, and that initrd is the one I use to start the host OS, it thinks it needs to start a raid and lvm in order to mount any disks. Oh oh.
  • In desperation, installed xen-tools to see what it did when it created a configuration file. It used the same kernel and initrd as I had, but instead of calling the virtual disks “hda1” etc, it called them “xvda1”.
  • Modified all my xen configuration files and fstabs and was able to bring up all three domUs.
  • When I attempted to reboot, the computer threw a bunch of errors and locked up. It appears that it was trying to save the xen configuration in /var/lib/xen/save. I’ve seen that before. So I modified /etc/default/xendomains to change the XENDOMAINS_SAVE variable to prevent it from saving. Now it’s shutting down correctly.

Good things

Today, Vicki and I went to Beers of the World to spend her Groupon. I got a variety of different beers, and I’m going to be mentioning them here just so I can look them up for next time.

Today’s beer was “Uerige”. I bought it because I was trying to remember a beer a German co-worker brought me from Germany back in the early 1980s when I was working at the Ministry of Transportation and Communications. It was a really nice beer, and I would probably buy it again. But it turns out that what I was remembering was Lohrer Urtyp 1878 beer. And it’s not listed on the Beers of the World web site. Oh well, maybe they’ll have that some day.

The other good thing was the movie “Battle : Los Angeles”. A few months ago, we went to see “Skyline”, and actually to tell you the truth, I’d originally thought “Skyline” was the movie I’d seen the really cool trailer for, but it turns out that “Battle : Los Angeles” (BLA) was that movie. “Skyline” suuuuuuuucked. It was just awful. I kept thinking “these people are horrible and unlikable, why don’t you show us what the Marines are doing?” BLA was the movie I was hoping for in Skyline. It was all about the Marines. And it wasn’t just better than Skyline, it was good. Yeah, it was escapist, yeah it was cliched in places, yeah you spend the first 20 minutes trying to play “match the noble death to the flawed but fundamentally good character”, yeah the scientific principle that aliens would invade for our water was laughable, but it was still engaging and fun.

Mild disappointment

Bought two new hard drives to add to my Linux box. Could only find one of the two SATA cables that I thought I had, so I went to FrozenCPU.com today to pick up some new ones. Got home, opened up the computer and found the missing SATA cable, and also discovered that there is only one power connector free. So tomorrow I’ll have to stop by FrozenCPU.com again to buy an adaptor. Fortunately they’re in East Rochester, and so is my physiotherapist, so it won’t be a wasted trip. But it does mean another day of failed backup jobs because I don’t have the extra disk space.