- Discovered that one of my hard disks was flakey and returned it. That’s probably why all my previous attempts to set this up failed.
- Removed the daughter card RAID controller. The built-in RAID controller still sees the disks, but reports them at a JBOD (Just a Box Of Disks).
- Started a new Debian installation.
- Set up the both whole disks as the software RAID1 (instead of just a partition on each disk like I did last time).
- Make the whole RAID (md0) into a physical volume (xen-space) for the LVM.
- Created a 4Gb root partition and a 1Gb swap partition as logical volumes on the physical volume.
- Did a base install. Noted that because I used software RAID on the whole thing, it uses LILO instead of Grub. Oh well, you can’t have everything.
- Rebooted and the BIOS only saw one of the two disks.
- Fiddled with the disk sled, rebooted, and this time it saw both.
- Evidently the first boot without the second disk caused the raid to degrade, so re-added the disk
mdadm /dev/md0 --add /dev/sdb1
and now it appears to be rebuilding.
Day 2:
- Installed smartmontools, and enabled in /etc/defaults/smartmontools. Express slight concern that /dev/sda has an exit status of 64 because of some error in the log, probably due to the late unpleasantness. Will have to figure out how to clear that.
- Installed munin-node and munin-plugin-extras, and copied the configuration from my backup from the last time
- Installed openssh-server (unselect xauth which gets added automatically because it drags in a ton of X11 libraries). Copied /etc/ssh/sshd_config and /root/.ssh directories from backup.
Day 3:
- Installed xen-utils. Holy shit that dragged in a lot of dependencies, and it said it had to “reinstall” 200+ packages for some damn reason. But then it gave an error, and when it came back it didn’t have to reinstall them after all. Very odd.
- Didn’t see any xen in /etc/lilo.conf, so installed linux-image-2.6-xen-amd64. (Had originally thought that installing xen-utils would do that, I thought it did last time.)
- Lilo complains that /vmlinuz is too big. According to the docs, lilo and xen don’t play together well, and grub has trouble with /dev/md0 software raid. I think I may have to go back to the drawing board, either re-installing the raid card, or going back to the primary boot partition and putting the software raid on the rest of the disk. Or maybe I can figure out how to get grub working. Once again I’m reminded of “Three Dead Trolls In a Baggie” singing “yeah, but I’ve got a girl friend and things to get done”.
Day 4:
- Reinstalled the Adaptec RAID card, and set up a hardware RAID-1
- Partitioned the “drive” with three partitions, one 4G ext3 for /, one 1G swap, and the rest as a physical volume for a lvm.
- Installed on /, and when it went to reboot it got to “shutting down md0” and then hung. Will have to check that again. But at least it installed Grub instead of LILO.
- After it booted, tried the “reboot” command and it worked! Yay!
- Installed smartmontools, but discovered (once again) that it doesn’t work with the raid controller, so uninstalled it. I need to find if there is some other way to monitor the raid controller. I think I tried the dpt_i20 thing before and it didn’t work.
Day 5:
- Installed sshd, copied the configuration from the backup to only allow public key logins. (Bite it, password guessers)
- Installed munin-node
- Installed linux-image-2.6-xen-amd64 and xen-hypervisor-3.2-1-amd64
- Rebooted and the damn thing spewed tons of errors and hung. Tried to reboot with the old kernel (that worked before) and I got the same errors. I guess it’s time to give up on that hardware RAID again.
Day 6
- Ran the disk “verify” tool in the raid card, and it didn’t find any errors.
- Anything I tried to boot the system (the original kernel that worked before, single user mode) still failed in aacraid.
- Ripped out the raid card again, and installed with /, /boot, /var and swap as primary partitions, and the rest of the space on both drives as a software RAID-1 used as a physical volume for LVM.
- Install openssh-server (and unselect xauth). Copy /etc/ssh/sshd_config and /root/.ssh from backup.
- Install smartmontools and enable it in /etc/default/smartmontoolsctl.
- Install munin-node.
- Rebooted to make sure everything starts correctly.
- Installed linux-image-2.6-xen-amd64 and xen-hypervisor-3.2-1-amd64
- Reboot again.
- Ok, it booted, but “xm list” isn’t up.
- Manually start xend and “xm list” is working.
- Rebooted, and this time “xm list” is working.
- Started to create the lvm logical volumes for the domUs
Day 7:
- Discovered that when I backed up the last nearly successful domU, I forgot to back up the boot partition, so I’m on my own for the grub configuration.
- Untarred my backups of the “xen2” and “xen3” domUs. Got a bunch of kernel messages about kjournald being blocked for more than X number of seconds while that was going on – I assume that’s because I was running up load averages in 7 and 8 range in the dom0, which is probably not a normal thing. I hope that just because things weren’t written to the journal immediately that doesn’t mean they were written wrong, only that I might have been in danger if things had died in the middle.
- Installed rsync so I can restore my backup of the “xen1” domU.
- Installed vim and removed vim-tiny
- Restored backup with
rsync --delete -aSurvx --numeric-ids /mnt/usb0/xen1/Sun/ /mnt/xen1/
- Copy the amd64 kernel modules to the domU’s /lib/modules.
cp -rp /lib/modules/2.6.26-2-xen-amd64 /mnt/xen1/lib/modules
Must remember to exclude /lib/modules when I do any final rsyncing from the live domUs. - DAMMIT! It appears that I made /var too small again. Once it saves /var/lib/xen/save in it, the file system is full. Need to move things around again.
- Booted into rescue mode, and moved things around. Everything seems to work now.
- Try to rsync some newer backups.
Further updates as things progress.