On my test system, with the two new disks and with the three domU guest systems set up and running (but not connected to the outside world, so the load is obviously lower), I run tiobench in the dom0. Executive summary:
read/write rates around 5 times faster, and no awful latency. The worse latency on the Caviar disks was 121588.99 ms, and on these disks it’s 6077.56 ms.
I can’t wait to get these disks over to the colo. If all goes well, it will just be a simple matter of shutting down the domUs, making a final copy from the old disks to the new disks, and then swapping the disks.
Full details below the cut.
Continue reading “That’s better!”
Having established that there is *something* making the disks run slowly on my colo box, I am resolved to fix it. One of my xen “tenants” generously donated two new Hitachi Deathstar^WDeskstar disks. In order to save some downtime, I’m setting up the new disks on the server that I replaced because I thought it was causing hardware problems (but which may or may not have been due to the crappy disks I was using). Setting up essentially a new server means I also have a chance to try out Debian 6, which became the “Stable” release a few weeks ago but which I haven’t had the nerve to upgrade the colo box to.
Fortunately, I have my previous post on Another try at setting up the new server to act as a checklist.
- Downloaded and burned the Debian 6 NetInst disk for AMD64.
- Installed the new disks in the old box and booted from the NetInst disk
- Just in case they fixed the problems with lvm and software RAID and grub not playing nice together, tried installing as a two disk software RAID-1 with LVM on top of that
- Installed with 4Gb root partition and 2Gb swap on LVM
- One of the install options was “SSH Server”, and so I choose that one
- Success! It boot with Grub with that configuration.
- Discovered that ssh installation dragged in xauth and a bunch of X11 libraries, so removed those.
- Installed smartmontools and enabled them in /etc/defaults/smartmontools
- Installed xen-utils and kernel and all the stuff that drags in.
- Rebooted and discovered to my relief that it boots the xen kernel.
- Installed rsync for backups.
- Installed munin-node and munin-plugins-extra.
- Installed vim and removed vim-tiny.
xm list isn’t working. Tried to manually start xend and got a screen full of errors. Tried to start it with the
/etc/init.d/xend start and nothing happened.
- Discovered it’s not starting xend because Grub is booting the xen kernel without the Hypervisor. If I choose the correct entry off the grub list, I get it. Now to figure out how to change the boot order in this new version of Grub.
- Took my backup disk and added it to the third drive sled so I’ll have SATA speeds when I restore from it.
/etc/default/grub and changed the
GRUB_DEFAULT value to 4 (remember they’re numbered from 0) and then ran
- Copy ssh configuration in /etc/ssh/ and ~root/.ssh from backup.
- Copy munin-node configuration in /etc/munin/
- Uninstalled exim4 and installed postfix because I know how to configure postfix.
- Copy postfix configuration from backup.
- Oops. Need the hostname configuration to match the hostname in postfix.
- Create lvm volumes with
lvcreate -L 150G -n xen1-disk xen-disk.
- Create file systems on them with
- Create swap with
- Installed ntp
- Copy backups with
rsync -aSurvx --numeric-ids --delete /mnt/sdc1/mp3s/ /mnt/mp3s/.
And at this point, while restoring the data from the backup to the disks, it started throwing SMART errors. Which at least vindicates our purchase of new hardware to replace this box. I was starting to worry that the problems we’d seen on this hardware were entirely due to the same disk problems we were seeing on the new hardware.
- Reformat the partitions with
- Still get the error on restoring the backup.
- Deleted the lv that was causing the problems, and tried creating a bunch of smaller ones.
- Make file systems on the smaller (50G) lvs and rsynced about 45Gb of data onto each one. Didn’t get any errors, so wondered if the errors were coming from the source disk.
- Did a
tar cvfz /dev/null . of the backup that was throwing the errors. That didn’t give any errors either.
- Removed the “junk” lvs and created the big one again. Did a
mkfs.ext3 -c on it.
- rsyncing the data over got the error on the same file again. And this time I’m almost sure it’s the backup disk, not the destination.
- Tried to copy the offending file to /tmp, and got the same error. So yes, it’s the backup disk.
- At this point, I have enough of the system restored that it’s painless to do the rest of the rsyncs from last night’s backups on my home server. So that’s what I’m doing. I’ve done
rsync -aSurvx --numeric-ids --delete xen1/Sun 192.168.1.119:/mnt/xen1 and it transferred about 10 files and deleted a couple of postgresql log files.
With that all done, it was time to get serious about setting up xen and running the domUs.
- Copied the domU configuration files from backup to /etc/xen.
- Modified them for the new kernel version (hey, is this the version with no global locks? That could be a huge win). Copied the appropriate /lib/modules/ into each of the domU directories
- Tried to start a domU. It complained about being unable to start the network. Copied a line out of the backup of /etc/xen/xend-config.sxp to the new one.
- Tried to start a domU. Ran out of memory.
- Remembered that the live site has 8Gb but this only has 4Gb, so reduced the size of the memory allocated to each domU.
- Tried to start a domU. It gave a bunch of errors about being unable to start the raid and the lvm. Thought about it for a while, and realized that since I’m specifying an initrd in the config file, and that initrd is the one I use to start the host OS, it thinks it needs to start a raid and lvm in order to mount any disks. Oh oh.
- In desperation, installed xen-tools to see what it did when it created a configuration file. It used the same kernel and initrd as I had, but instead of calling the virtual disks “hda1” etc, it called them “xvda1”.
- Modified all my xen configuration files and fstabs and was able to bring up all three domUs.
- When I attempted to reboot, the computer threw a bunch of errors and locked up. It appears that it was trying to save the xen configuration in /var/lib/xen/save. I’ve seen that before. So I modified /etc/default/xendomains to change the XENDOMAINS_SAVE variable to prevent it from saving. Now it’s shutting down correctly.
Since putting in the new colo box, we (myself and the two “tenants” on the Xen user domains (domU)) have noticed it being very slow. At times it seems like the first time you try something it will be very slow, but if you try again immediately it will run quickly. For instance, sometimes a page load will time out, but you hit refresh and it will load quite quickly. I’ve started to suspect the problem is the disks, because the CPU is pretty fast. In order to pin point the problem, I’ve decided to try and benchmark the colo box against my home computer. Both computers have SATA 3Gb/s disks in a software RAID-1 (mirror). Both computers have dual core CPUs (although the home one is a Core2 Duo at 1.86GHz and the colo is a Xeon at 3.0GHz). However, the colo is also running tons of other stuff and it’s running in a Xen domU, so that might slow things down a bit.
Continue reading “Houston, we have a problem”
Today, Vicki and I went to Beers of the World to spend her Groupon. I got a variety of different beers, and I’m going to be mentioning them here just so I can look them up for next time.
Today’s beer was “Uerige”. I bought it because I was trying to remember a beer a German co-worker brought me from Germany back in the early 1980s when I was working at the Ministry of Transportation and Communications. It was a really nice beer, and I would probably buy it again. But it turns out that what I was remembering was Lohrer Urtyp 1878 beer. And it’s not listed on the Beers of the World web site. Oh well, maybe they’ll have that some day.
The other good thing was the movie “Battle : Los Angeles”. A few months ago, we went to see “Skyline”, and actually to tell you the truth, I’d originally thought “Skyline” was the movie I’d seen the really cool trailer for, but it turns out that “Battle : Los Angeles” (BLA) was that movie. “Skyline” suuuuuuuucked. It was just awful. I kept thinking “these people are horrible and unlikable, why don’t you show us what the Marines are doing?” BLA was the movie I was hoping for in Skyline. It was all about the Marines. And it wasn’t just better than Skyline, it was good. Yeah, it was escapist, yeah it was cliched in places, yeah you spend the first 20 minutes trying to play “match the noble death to the flawed but fundamentally good character”, yeah the scientific principle that aliens would invade for our water was laughable, but it was still engaging and fun.
Bought two new hard drives to add to my Linux box. Could only find one of the two SATA cables that I thought I had, so I went to FrozenCPU.com today to pick up some new ones. Got home, opened up the computer and found the missing SATA cable, and also discovered that there is only one power connector free. So tomorrow I’ll have to stop by FrozenCPU.com again to buy an adaptor. Fortunately they’re in East Rochester, and so is my physiotherapist, so it won’t be a wasted trip. But it does mean another day of failed backup jobs because I don’t have the extra disk space.