March 26, 2011 – Rants and Revelations

Having established that there is *something* making the disks run slowly on my colo box, I am resolved to fix it. One of my xen “tenants” generously donated two new Hitachi Deathstar^WDeskstar disks. In order to save some downtime, I’m setting up the new disks on the server that I replaced because I thought it was causing hardware problems (but which may or may not have been due to the crappy disks I was using). Setting up essentially a new server means I also have a chance to try out Debian 6, which became the “Stable” release a few weeks ago but which I haven’t had the nerve to upgrade the colo box to.

Fortunately, I have my previous post on Another try at setting up the new server to act as a checklist.

Day 1

Downloaded and burned the Debian 6 NetInst disk for AMD64.
Installed the new disks in the old box and booted from the NetInst disk
Just in case they fixed the problems with lvm and software RAID and grub not playing nice together, tried installing as a two disk software RAID-1 with LVM on top of that
Installed with 4Gb root partition and 2Gb swap on LVM
One of the install options was “SSH Server”, and so I choose that one
Success! It boot with Grub with that configuration.
Discovered that ssh installation dragged in xauth and a bunch of X11 libraries, so removed those.
Installed smartmontools and enabled them in /etc/defaults/smartmontools
Installed xen-utils and kernel and all the stuff that drags in.
Rebooted and discovered to my relief that it boots the xen kernel.
Installed rsync for backups.
Installed munin-node and munin-plugins-extra.
Installed vim and removed vim-tiny.
xm list isn’t working. Tried to manually start xend and got a screen full of errors. Tried to start it with the /etc/init.d/xend start and nothing happened.
Discovered it’s not starting xend because Grub is booting the xen kernel without the Hypervisor. If I choose the correct entry off the grub list, I get it. Now to figure out how to change the boot order in this new version of Grub.
Took my backup disk and added it to the third drive sled so I’ll have SATA speeds when I restore from it.
Edited /etc/default/grub and changed the GRUB_DEFAULT value to 4 (remember they’re numbered from 0) and then ran update-grub.
Copy ssh configuration in /etc/ssh/ and ~root/.ssh from backup.
Copy munin-node configuration in /etc/munin/
Uninstalled exim4 and installed postfix because I know how to configure postfix.
Copy postfix configuration from backup.
Oops. Need the hostname configuration to match the hostname in postfix.
Create lvm volumes with lvcreate -L 150G -n xen1-disk xen-disk.
Create file systems on them with mkfs.ext3 /dev/xen-disk/xen1-disk.
Create swap with mkswap /dev/xen-disk/xen1-swap.
Installed ntp
Copy backups with rsync -aSurvx --numeric-ids --delete /mnt/sdc1/mp3s/ /mnt/mp3s/.

And at this point, while restoring the data from the backup to the disks, it started throwing SMART errors. Which at least vindicates our purchase of new hardware to replace this box. I was starting to worry that the problems we’d seen on this hardware were entirely due to the same disk problems we were seeing on the new hardware.

Continuing on:

Reformat the partitions with mkfs.ext3 -c.
Still get the error on restoring the backup.
Deleted the lv that was causing the problems, and tried creating a bunch of smaller ones.
Make file systems on the smaller (50G) lvs and rsynced about 45Gb of data onto each one. Didn’t get any errors, so wondered if the errors were coming from the source disk.
Did a tar cvfz /dev/null . of the backup that was throwing the errors. That didn’t give any errors either.
Removed the “junk” lvs and created the big one again. Did a mkfs.ext3 -c on it.
rsyncing the data over got the error on the same file again. And this time I’m almost sure it’s the backup disk, not the destination.
Tried to copy the offending file to /tmp, and got the same error. So yes, it’s the backup disk.
At this point, I have enough of the system restored that it’s painless to do the rest of the rsyncs from last night’s backups on my home server. So that’s what I’m doing. I’ve done rsync -aSurvx --numeric-ids --delete xen1/Sun 192.168.1.119:/mnt/xen1 and it transferred about 10 files and deleted a couple of postgresql log files.

With that all done, it was time to get serious about setting up xen and running the domUs.

Copied the domU configuration files from backup to /etc/xen.
Modified them for the new kernel version (hey, is this the version with no global locks? That could be a huge win). Copied the appropriate /lib/modules/ into each of the domU directories
Tried to start a domU. It complained about being unable to start the network. Copied a line out of the backup of /etc/xen/xend-config.sxp to the new one.
Tried to start a domU. Ran out of memory.
Remembered that the live site has 8Gb but this only has 4Gb, so reduced the size of the memory allocated to each domU.
Tried to start a domU. It gave a bunch of errors about being unable to start the raid and the lvm. Thought about it for a while, and realized that since I’m specifying an initrd in the config file, and that initrd is the one I use to start the host OS, it thinks it needs to start a raid and lvm in order to mount any disks. Oh oh.
In desperation, installed xen-tools to see what it did when it created a configuration file. It used the same kernel and initrd as I had, but instead of calling the virtual disks “hda1” etc, it called them “xvda1”.
Modified all my xen configuration files and fstabs and was able to bring up all three domUs.
When I attempted to reboot, the computer threw a bunch of errors and locked up. It appears that it was trying to save the xen configuration in /var/lib/xen/save. I’ve seen that before. So I modified /etc/default/xendomains to change the XENDOMAINS_SAVE variable to prevent it from saving. Now it’s shutting down correctly.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Day: March 26, 2011

New disks set-up