What have I been up to recently?

Since the Tupper Lake race, I’ve only paddled the ski. I need to get used to paddling in waves, with the Rochester Open Water Challenge less than two weeks away. Tonight, for instance, Paul D and I did some surfing, but we also spent some time paddling up and down the shore – our theory was that we would experience waves from the side, which is the hardest to handle, but we were in shallow water so if we dumped (and I dumped a few times) it was no trouble to get back in the boat. I haven’t been paddling with my GPS much, so I don’t know what has happened to my training volume other than the feeling that it’s way down. Paddling out in the surf requires different muscles and it’s not particularly fast, so an hour or an hour and a half is about all I can stand, and I probably make less than 3 or 4 miles in that time.

I’ve also settled on a name for the ski. In the past, I named my Skerray “Mary Ellen Carter”, after the song by Stan Rogers, because it enabled me to “rise again”, and the Looksha was “Gideon Brown” after the song by Great Big Sea, because she can “punch ahead in any gale”. I called the Thunderbold “Anne-Marie” after the boat in Stan Rogers’ song “Acadian Saturday Night” because it has “wings on the water”. And so now I’m naming the ski Old Polina because I “fly a long like a song” in it. Or at least I hope to.

I had a great visit with my dad, step mother and kids this weekend. It’s great to see them, especially my daughters. They’re both maturing so well. I still worry about them, but I suppose that’s my job.

In other news, I’m still trying to finish setting up the replacement hardware. I’m experimenting with using LVM snapshots to be able to backup the domU partitions while they’re active – I think what I’ll do is make snapshots, rsync them over to the new server’s partitions, then delete them, and then shut down the domUs and rsync them again while they’re shut down, and then start up the new guys. By rsyncing once with snapshots, that should make the amount of time between shutting down the partitions and bringing up the new ones much faster. I’m also going to look into replacing my current rsync backup scripts to ones that use snapshots as well, because that way I never have to worry about inconsistencies in the file system, especially in the database engines.

Another try at setting up the new server

  • Discovered that one of my hard disks was flakey and returned it. That’s probably why all my previous attempts to set this up failed.
  • Removed the daughter card RAID controller. The built-in RAID controller still sees the disks, but reports them at a JBOD (Just a Box Of Disks).
  • Started a new Debian installation.
  • Set up the both whole disks as the software RAID1 (instead of just a partition on each disk like I did last time).
  • Make the whole RAID (md0) into a physical volume (xen-space) for the LVM.
  • Created a 4Gb root partition and a 1Gb swap partition as logical volumes on the physical volume.
  • Did a base install. Noted that because I used software RAID on the whole thing, it uses LILO instead of Grub. Oh well, you can’t have everything.
  • Rebooted and the BIOS only saw one of the two disks.
  • Fiddled with the disk sled, rebooted, and this time it saw both.
  • Evidently the first boot without the second disk caused the raid to degrade, so re-added the disk mdadm /dev/md0 --add /dev/sdb1 and now it appears to be rebuilding.

Day 2:

  • Installed smartmontools, and enabled in /etc/defaults/smartmontools. Express slight concern that /dev/sda has an exit status of 64 because of some error in the log, probably due to the late unpleasantness. Will have to figure out how to clear that.
  • Installed munin-node and munin-plugin-extras, and copied the configuration from my backup from the last time
  • Installed openssh-server (unselect xauth which gets added automatically because it drags in a ton of X11 libraries). Copied /etc/ssh/sshd_config and /root/.ssh directories from backup.

Day 3:

  • Installed xen-utils. Holy shit that dragged in a lot of dependencies, and it said it had to “reinstall” 200+ packages for some damn reason. But then it gave an error, and when it came back it didn’t have to reinstall them after all. Very odd.
  • Didn’t see any xen in /etc/lilo.conf, so installed linux-image-2.6-xen-amd64. (Had originally thought that installing xen-utils would do that, I thought it did last time.)
  • Lilo complains that /vmlinuz is too big. According to the docs, lilo and xen don’t play together well, and grub has trouble with /dev/md0 software raid. I think I may have to go back to the drawing board, either re-installing the raid card, or going back to the primary boot partition and putting the software raid on the rest of the disk. Or maybe I can figure out how to get grub working. Once again I’m reminded of “Three Dead Trolls In a Baggie” singing “yeah, but I’ve got a girl friend and things to get done”.

Day 4:

  • Reinstalled the Adaptec RAID card, and set up a hardware RAID-1
  • Partitioned the “drive” with three partitions, one 4G ext3 for /, one 1G swap, and the rest as a physical volume for a lvm.
  • Installed on /, and when it went to reboot it got to “shutting down md0” and then hung. Will have to check that again. But at least it installed Grub instead of LILO.
  • After it booted, tried the “reboot” command and it worked! Yay!
  • Installed smartmontools, but discovered (once again) that it doesn’t work with the raid controller, so uninstalled it. I need to find if there is some other way to monitor the raid controller. I think I tried the dpt_i20 thing before and it didn’t work.

Day 5:

  • Installed sshd, copied the configuration from the backup to only allow public key logins. (Bite it, password guessers)
  • Installed munin-node
  • Installed linux-image-2.6-xen-amd64 and xen-hypervisor-3.2-1-amd64
  • Rebooted and the damn thing spewed tons of errors and hung. Tried to reboot with the old kernel (that worked before) and I got the same errors. I guess it’s time to give up on that hardware RAID again.

Day 6

  • Ran the disk “verify” tool in the raid card, and it didn’t find any errors.
  • Anything I tried to boot the system (the original kernel that worked before, single user mode) still failed in aacraid.
  • Ripped out the raid card again, and installed with /, /boot, /var and swap as primary partitions, and the rest of the space on both drives as a software RAID-1 used as a physical volume for LVM.
  • Install openssh-server (and unselect xauth). Copy /etc/ssh/sshd_config and /root/.ssh from backup.
  • Install smartmontools and enable it in /etc/default/smartmontoolsctl.
  • Install munin-node.
  • Rebooted to make sure everything starts correctly.
  • Installed linux-image-2.6-xen-amd64 and xen-hypervisor-3.2-1-amd64
  • Reboot again.
  • Ok, it booted, but “xm list” isn’t up.
  • Manually start xend and “xm list” is working.
  • Rebooted, and this time “xm list” is working.
  • Started to create the lvm logical volumes for the domUs

Day 7:

  • Discovered that when I backed up the last nearly successful domU, I forgot to back up the boot partition, so I’m on my own for the grub configuration.
  • Untarred my backups of the “xen2” and “xen3” domUs. Got a bunch of kernel messages about kjournald being blocked for more than X number of seconds while that was going on – I assume that’s because I was running up load averages in 7 and 8 range in the dom0, which is probably not a normal thing. I hope that just because things weren’t written to the journal immediately that doesn’t mean they were written wrong, only that I might have been in danger if things had died in the middle.
  • Installed rsync so I can restore my backup of the “xen1” domU.
  • Installed vim and removed vim-tiny
  • Restored backup with rsync --delete -aSurvx --numeric-ids /mnt/usb0/xen1/Sun/ /mnt/xen1/
  • Copy the amd64 kernel modules to the domU’s /lib/modules. cp -rp /lib/modules/2.6.26-2-xen-amd64 /mnt/xen1/lib/modules Must remember to exclude /lib/modules when I do any final rsyncing from the live domUs.
  • DAMMIT! It appears that I made /var too small again. Once it saves /var/lib/xen/save in it, the file system is full. Need to move things around again.
  • Booted into rescue mode, and moved things around. Everything seems to work now.
  • Try to rsync some newer backups.

Further updates as things progress.

Programming Tests: Useless waste of time, or massive insult?

I just had to take another one of those stupid “BrainBench” programming tests for a job I’m applying for. (I’m not planning on leaving the place I’m working now, but as a contractor you always have to be ready to jump ship.) These tests are a complete and utter crock of shit. They don’t test if you can program, all they test is if you’ve memorized every obscure and complicated part of the literally millions of lines of Java API documentation out there. A typical question will show you four snippets of code, and ask you which is the correct one. And they’re not localized down into one part of the API that might be relevant to a particular job, either.

One of the questions was about how to set up a cookie handler on a persistent URLConnection. Another was on how to set the line width in a Graphics2D line. And another was on setting up a pipe between a sender and receiver. You know, if I’d needed to do any of those things in my 12+ years of programming in Java, I might have bothered to memorize that page of the API documentation. But I haven’t, so I haven’t. Instead, I’ve wasted my time learning useless things like how to write code so that when you or somebody else comes to add some functionality a few years from now, they can figure out why you did it the way you did it, and can add their stuff without breaking what you’ve got there. And how to debug an obscure exception that only happens after the program has been running for 45 days. And how to structure a program so that it’s fast, reliable, does what it’s supposed to do and doesn’t crash. You know, frivolous stuff like that.

Evidently I’m not alone in my hatred and disgust for these stupid tests. According to The Register, Ken Thompson, co-inventor of Unix and of the C programming language, isn’t allowed to check code in at Google because he refuses to take their stupid C language test. How stupid do you have to be to insult the man who invented the language by asking him to take some test that was probably written by some 24 year old “language lawyer” straight out of school who has memorized every obscure part of the language without being able to use any of it well?

So I’m guessing that I probably did horribly on that test, and that I’m probably not going to get an interview there, and that’s perfectly fine with me because I don’t want to work with people who were chosen because they’re good at memorizing language documents. They probably write horrible code, but think they’re great because they don’t have to stop and think and look things up.

Maybe VPS or cloud *is* the way to go, after all.

More problems setting up the new colo box:

  • I shut down with all three domUs running, and when it came back up, xend wasn’t running so they wouldn’t start up again. Further investigation showed that I hadn’t made my root partition big enough to handle when it saved the current xen state to /var/lib/xend. So I made a new lvm partition for /var/lib and mounted that instead. So far, so good.
  • While I was investigating this, I noticed my software raid was running in degraded state, because it had lost /dev/sda3. I re-added it and it started to rebuild it.
  • While it was rebuilding, I noticed that xen1, my first domU wasn’t running any more. When I tried to recreate it, it told me I couldn’t recreate it because its main disk, /dev/xen-space/xen1-space, was mounted in a guest domain. Oh oh.
  • Then I got a kernel panic. Double oh oh.
  • I rebooted, and tried to rebuild the raid without xend running, but I got another kernel panic.

This is supposed to be easy and fun, right?

I’m currently booted with the rescue disk, and I’m trying to rebuild the raid again. If that dies, it might be time to cut my losses.

What to do, what to do….?

Back around the beginning of March, the box this blog (and lots of other things) is hosted on failed, and I got it back up by removing one of the CPUs. Since that time, I bought some replacement hardware, and have had it 90% set up for about a month here. But I haven’t quite figured out how to make the transition to the state where everything is running on the new box without another week of downtime. Ideally I’d like to have both boxes on a rack somewhere so I can shut down the domU (guest domains) on the first box, rsync everything over, and bring them up on the new host, and then change the DNS entries.

One of the reasons I’m holding off on doing this is that my hosting site charged me $105 just for the privilege of taking my box off the rack and putting it back, and both operations took them *hours* to accomplish, mostly because their business office is the other side of town from where the rack is. And they don’t let you visit the rack yourself.

There is a second hosting company in town, and they advertise lower prices than I’m paying at my current host, and they say “If you want a site tour, let us know”. They also seem to have their rack space in the same building as their business office, so I have hopes that they wouldn’t be able to rack the box in less than 12 hours. So I’ve let them know that I want a tour. Twice. The first time, they ignored me. The second time, somebody contacted me to say he was out of town the next week, but he’d have somebody else contact me, and that person never did. So I’m pretty much ready to give up on them. Which is too bad, because that would be ideal – I’d rack the box, do the rsyncs, move the DNS entries over, and when it appeared everything was working, cancel my contract with the old place.

So now I guess my option is to ask the old guys how much they’d charge me for a couple of weeks of having two machines on the rack with 4 more IPs. I’m betting it’s more than the $105 they charged me to unrack and rack my box last time.