Continuing Saga, last (I hope) episode

Continued from here. I got the disks out to the colo facility, and swapped them in. At first, things didn’t come up right because it didn’t have a network. Funny, because I’d remembered to fix /etc/network/interfaces and /etc/resolv.conf back to the values they need in the colo facility before I’d shut down at home. Grepping through the dmesg results showed that for some reason, eth0 had been renamed to eth2 and eth1 had been renamed to eth3. Something tickled my memory about the last time I’d been through this – it remembered the MAC addresses on the machine you set it up on, so when it boots the new machine it thinks “aha, I already know where eth0 and eth1 are, so these new MAC addresses need to be mapped somewhere else”. Unfortunately my own blog was down, so I couldn’t find what I’d written about this before, but a quick google on my iPhone and I removed /etc/udev/rules.d/70-persistent-net.rules, and rebooted, and it all came up.

Made sure I was talking to the net, and I could ssh into it from home, and then started up the guest domains. Made sure they were up and talking to the net as well. Made sure one of my web sites showed up on my iPhone. Buttoned up and went home.

Once I got home, I made some further checks that everything was up. As far as I can tell, it is. Now to run tiobench on the updated system.

Let’s have a look at some of these compared to the results I got with the Caviar Green disks running on the same hardware.

Test Old New
Sequential Read best rate 44.49 168.26
Random Read best rate 0.39 1.37
Read Max Latency 1036.44 743.22
Sequential Write best rate 10.36 82.46
Random Write best rate 0.09 1.76
Write Max Latency 143896.67 1748.55

Basically the huge latency will be the biggest difference. I don’t know if that’s because of the WD Caviar Green “spin down” or because there were disk errors, but either way, it’s going to be a relief to see some performance again.

Raw results after the cut.
Continue reading “Continuing Saga, last (I hope) episode”

The continuing saga of replacing disks

Well, after last Monday, I realized that I wasn’t backing up all the files I’d need on the new system, so I had to change my nightly backup. Unfortunately that caused the nightly backup to take so long that it interfered with the next night’s backup, so when I got home from 2 days in Ithaca I discovered my nightly backups were kind of screwed up. I now have that mostly fixed up.

I copied the lastest backups over to the new server and started up the three guest OSes, only to find that one doesn’t start right, because it appears to be missing scripts from /lib/init/. I copied them over from another guest and it seems to be starting, as well as I can tell.

I also recently discovered the “H” option to rsync to preserve hard links. This might be useful, although it seems to conflict with “link-dest”.

Current plan is to make another backup of the three guests, and rsync that over. Then shut down the three guests, and rsync that over. Then whip out the disks and go running over to the colo facility and try to get them in. Depending on how long the first step takes, I might wait until tomorrow.

That’s better!

On my test system, with the two new disks and with the three domU guest systems set up and running (but not connected to the outside world, so the load is obviously lower), I run tiobench in the dom0. Executive summary:
read/write rates around 5 times faster, and no awful latency. The worse latency on the Caviar disks was 121588.99 ms, and on these disks it’s 6077.56 ms.

I can’t wait to get these disks over to the colo. If all goes well, it will just be a simple matter of shutting down the domUs, making a final copy from the old disks to the new disks, and then swapping the disks.

Full details below the cut.
Continue reading “That’s better!”

New disks set-up

Having established that there is *something* making the disks run slowly on my colo box, I am resolved to fix it. One of my xen “tenants” generously donated two new Hitachi Deathstar^WDeskstar disks. In order to save some downtime, I’m setting up the new disks on the server that I replaced because I thought it was causing hardware problems (but which may or may not have been due to the crappy disks I was using). Setting up essentially a new server means I also have a chance to try out Debian 6, which became the “Stable” release a few weeks ago but which I haven’t had the nerve to upgrade the colo box to.

Fortunately, I have my previous post on Another try at setting up the new server to act as a checklist.

Day 1

  • Downloaded and burned the Debian 6 NetInst disk for AMD64.
  • Installed the new disks in the old box and booted from the NetInst disk
  • Just in case they fixed the problems with lvm and software RAID and grub not playing nice together, tried installing as a two disk software RAID-1 with LVM on top of that
  • Installed with 4Gb root partition and 2Gb swap on LVM
  • One of the install options was “SSH Server”, and so I choose that one
  • Success! It boot with Grub with that configuration.
  • Discovered that ssh installation dragged in xauth and a bunch of X11 libraries, so removed those.
  • Installed smartmontools and enabled them in /etc/defaults/smartmontools
  • Installed xen-utils and kernel and all the stuff that drags in.
  • Rebooted and discovered to my relief that it boots the xen kernel.
  • Installed rsync for backups.
  • Installed munin-node and munin-plugins-extra.
  • Installed vim and removed vim-tiny.
  • xm list isn’t working. Tried to manually start xend and got a screen full of errors. Tried to start it with the /etc/init.d/xend start and nothing happened.
  • Discovered it’s not starting xend because Grub is booting the xen kernel without the Hypervisor. If I choose the correct entry off the grub list, I get it. Now to figure out how to change the boot order in this new version of Grub.
  • Took my backup disk and added it to the third drive sled so I’ll have SATA speeds when I restore from it.
  • Edited /etc/default/grub and changed the GRUB_DEFAULT value to 4 (remember they’re numbered from 0) and then ran update-grub.
  • Copy ssh configuration in /etc/ssh/ and ~root/.ssh from backup.
  • Copy munin-node configuration in /etc/munin/
  • Uninstalled exim4 and installed postfix because I know how to configure postfix.
  • Copy postfix configuration from backup.
  • Oops. Need the hostname configuration to match the hostname in postfix.
  • Create lvm volumes with lvcreate -L 150G -n xen1-disk xen-disk.
  • Create file systems on them with mkfs.ext3 /dev/xen-disk/xen1-disk.
  • Create swap with mkswap /dev/xen-disk/xen1-swap.
  • Installed ntp
  • Copy backups with rsync -aSurvx --numeric-ids --delete /mnt/sdc1/mp3s/ /mnt/mp3s/.

And at this point, while restoring the data from the backup to the disks, it started throwing SMART errors. Which at least vindicates our purchase of new hardware to replace this box. I was starting to worry that the problems we’d seen on this hardware were entirely due to the same disk problems we were seeing on the new hardware.

Continuing on:

  • Reformat the partitions with mkfs.ext3 -c.
  • Still get the error on restoring the backup.
  • Deleted the lv that was causing the problems, and tried creating a bunch of smaller ones.
  • Make file systems on the smaller (50G) lvs and rsynced about 45Gb of data onto each one. Didn’t get any errors, so wondered if the errors were coming from the source disk.
  • Did a tar cvfz /dev/null . of the backup that was throwing the errors. That didn’t give any errors either.
  • Removed the “junk” lvs and created the big one again. Did a mkfs.ext3 -c on it.
  • rsyncing the data over got the error on the same file again. And this time I’m almost sure it’s the backup disk, not the destination.
  • Tried to copy the offending file to /tmp, and got the same error. So yes, it’s the backup disk.
  • At this point, I have enough of the system restored that it’s painless to do the rest of the rsyncs from last night’s backups on my home server. So that’s what I’m doing. I’ve done rsync -aSurvx --numeric-ids --delete xen1/Sun 192.168.1.119:/mnt/xen1 and it transferred about 10 files and deleted a couple of postgresql log files.

With that all done, it was time to get serious about setting up xen and running the domUs.

  • Copied the domU configuration files from backup to /etc/xen.
  • Modified them for the new kernel version (hey, is this the version with no global locks? That could be a huge win). Copied the appropriate /lib/modules/ into each of the domU directories
  • Tried to start a domU. It complained about being unable to start the network. Copied a line out of the backup of /etc/xen/xend-config.sxp to the new one.
  • Tried to start a domU. Ran out of memory.
  • Remembered that the live site has 8Gb but this only has 4Gb, so reduced the size of the memory allocated to each domU.
  • Tried to start a domU. It gave a bunch of errors about being unable to start the raid and the lvm. Thought about it for a while, and realized that since I’m specifying an initrd in the config file, and that initrd is the one I use to start the host OS, it thinks it needs to start a raid and lvm in order to mount any disks. Oh oh.
  • In desperation, installed xen-tools to see what it did when it created a configuration file. It used the same kernel and initrd as I had, but instead of calling the virtual disks “hda1” etc, it called them “xvda1”.
  • Modified all my xen configuration files and fstabs and was able to bring up all three domUs.
  • When I attempted to reboot, the computer threw a bunch of errors and locked up. It appears that it was trying to save the xen configuration in /var/lib/xen/save. I’ve seen that before. So I modified /etc/default/xendomains to change the XENDOMAINS_SAVE variable to prevent it from saving. Now it’s shutting down correctly.