Another try at setting up the new server

  • Discovered that one of my hard disks was flakey and returned it. That’s probably why all my previous attempts to set this up failed.
  • Removed the daughter card RAID controller. The built-in RAID controller still sees the disks, but reports them at a JBOD (Just a Box Of Disks).
  • Started a new Debian installation.
  • Set up the both whole disks as the software RAID1 (instead of just a partition on each disk like I did last time).
  • Make the whole RAID (md0) into a physical volume (xen-space) for the LVM.
  • Created a 4Gb root partition and a 1Gb swap partition as logical volumes on the physical volume.
  • Did a base install. Noted that because I used software RAID on the whole thing, it uses LILO instead of Grub. Oh well, you can’t have everything.
  • Rebooted and the BIOS only saw one of the two disks.
  • Fiddled with the disk sled, rebooted, and this time it saw both.
  • Evidently the first boot without the second disk caused the raid to degrade, so re-added the disk mdadm /dev/md0 --add /dev/sdb1 and now it appears to be rebuilding.

Day 2:

  • Installed smartmontools, and enabled in /etc/defaults/smartmontools. Express slight concern that /dev/sda has an exit status of 64 because of some error in the log, probably due to the late unpleasantness. Will have to figure out how to clear that.
  • Installed munin-node and munin-plugin-extras, and copied the configuration from my backup from the last time
  • Installed openssh-server (unselect xauth which gets added automatically because it drags in a ton of X11 libraries). Copied /etc/ssh/sshd_config and /root/.ssh directories from backup.

Day 3:

  • Installed xen-utils. Holy shit that dragged in a lot of dependencies, and it said it had to “reinstall” 200+ packages for some damn reason. But then it gave an error, and when it came back it didn’t have to reinstall them after all. Very odd.
  • Didn’t see any xen in /etc/lilo.conf, so installed linux-image-2.6-xen-amd64. (Had originally thought that installing xen-utils would do that, I thought it did last time.)
  • Lilo complains that /vmlinuz is too big. According to the docs, lilo and xen don’t play together well, and grub has trouble with /dev/md0 software raid. I think I may have to go back to the drawing board, either re-installing the raid card, or going back to the primary boot partition and putting the software raid on the rest of the disk. Or maybe I can figure out how to get grub working. Once again I’m reminded of “Three Dead Trolls In a Baggie” singing “yeah, but I’ve got a girl friend and things to get done”.

Day 4:

  • Reinstalled the Adaptec RAID card, and set up a hardware RAID-1
  • Partitioned the “drive” with three partitions, one 4G ext3 for /, one 1G swap, and the rest as a physical volume for a lvm.
  • Installed on /, and when it went to reboot it got to “shutting down md0” and then hung. Will have to check that again. But at least it installed Grub instead of LILO.
  • After it booted, tried the “reboot” command and it worked! Yay!
  • Installed smartmontools, but discovered (once again) that it doesn’t work with the raid controller, so uninstalled it. I need to find if there is some other way to monitor the raid controller. I think I tried the dpt_i20 thing before and it didn’t work.

Day 5:

  • Installed sshd, copied the configuration from the backup to only allow public key logins. (Bite it, password guessers)
  • Installed munin-node
  • Installed linux-image-2.6-xen-amd64 and xen-hypervisor-3.2-1-amd64
  • Rebooted and the damn thing spewed tons of errors and hung. Tried to reboot with the old kernel (that worked before) and I got the same errors. I guess it’s time to give up on that hardware RAID again.

Day 6

  • Ran the disk “verify” tool in the raid card, and it didn’t find any errors.
  • Anything I tried to boot the system (the original kernel that worked before, single user mode) still failed in aacraid.
  • Ripped out the raid card again, and installed with /, /boot, /var and swap as primary partitions, and the rest of the space on both drives as a software RAID-1 used as a physical volume for LVM.
  • Install openssh-server (and unselect xauth). Copy /etc/ssh/sshd_config and /root/.ssh from backup.
  • Install smartmontools and enable it in /etc/default/smartmontoolsctl.
  • Install munin-node.
  • Rebooted to make sure everything starts correctly.
  • Installed linux-image-2.6-xen-amd64 and xen-hypervisor-3.2-1-amd64
  • Reboot again.
  • Ok, it booted, but “xm list” isn’t up.
  • Manually start xend and “xm list” is working.
  • Rebooted, and this time “xm list” is working.
  • Started to create the lvm logical volumes for the domUs

Day 7:

  • Discovered that when I backed up the last nearly successful domU, I forgot to back up the boot partition, so I’m on my own for the grub configuration.
  • Untarred my backups of the “xen2” and “xen3” domUs. Got a bunch of kernel messages about kjournald being blocked for more than X number of seconds while that was going on – I assume that’s because I was running up load averages in 7 and 8 range in the dom0, which is probably not a normal thing. I hope that just because things weren’t written to the journal immediately that doesn’t mean they were written wrong, only that I might have been in danger if things had died in the middle.
  • Installed rsync so I can restore my backup of the “xen1” domU.
  • Installed vim and removed vim-tiny
  • Restored backup with rsync --delete -aSurvx --numeric-ids /mnt/usb0/xen1/Sun/ /mnt/xen1/
  • Copy the amd64 kernel modules to the domU’s /lib/modules. cp -rp /lib/modules/2.6.26-2-xen-amd64 /mnt/xen1/lib/modules Must remember to exclude /lib/modules when I do any final rsyncing from the live domUs.
  • DAMMIT! It appears that I made /var too small again. Once it saves /var/lib/xen/save in it, the file system is full. Need to move things around again.
  • Booted into rescue mode, and moved things around. Everything seems to work now.
  • Try to rsync some newer backups.

Further updates as things progress.