Another damn failure on my colo box

Yesterday I had a panic attack – suddenly a certain repository of binary files was empty where it had been nearly 300Gb a few hours previously. I knew that Vicki was uploading some stuff to it today using an “ncftpput” command that I’d shown her but which I knew she didn’t understand what each command line argument meant. So I’m sorry to say my first reaction was “I bet she somehow wiped it”. But I looked in her ~/.ncftp/spool/log file and couldn’t see anything unusual. I guess I owe her an apology for that thought.

I looked on the domU, and “df” showed the partition still mounted, and still 91% full. But nothing showed up to “ls”. I unmounted it and shut off the nightly backups so that it doesn’t delete the backup. “fsck.ext3 /dev/hdb” gave an error about a zero length partition. Then I thought I should probably be doing this on the dom0, and so I logged into it and had the same error with “fsck.ext3 /dev/hdb1”. “fsck -l /dev/hdb” on that entire drive showed that it didn’t think the drive was there at all. Oh oh. Moment of panic time – one of the other domUs has some of its disk space on that drive as well, thanks to LVM. I wonder what’s screwing up on his domU if it can’t get to some of its disk space. Time to shut them all down and reboot.

I did an “xm console xen1” to connect to my domU and that’s where I saw the oh-so-familiar ext3 errors. But everything shut down relatively cleanly and rebooted. I saw one message in the log files about resetting the ide0 controller, which I’m not sure was caused by or the cause of the problem. And after the reboot, all 300Gb of files were back. Thank goodness, because the upload bandwidth I’ve got at home these days means it would have taken months to get that partition restored from my backup.

This partition that screwed up this time is a normal primary disk partition, not an LVM logical volume, and on a different physical drive than the other failures, so at least I’ve eliminated LVM and the disk hardware as a cause. But that leaves the IDE controller and Xen.

I can’t wait for my new 1U server to come. Still not sure whether I should try Xen again or VMWare. VMWare probably isn’t as fast and it’s a lot more difficult to manage without getting the for-pay version, but at least it’s “ready for prime time”.

What a pain in the ass

This morning while perusing my logwatch mails I see a strange result from the script that is supposed to email me with the day’s changes from my DAFIF Replacement wiki. It was complaining about a missing perl module in the twiki/bin directory. So I look, and the twiki/bin directory is totally empty.

Some low life found a vulnerability in TWiki, and used it to remove everything in twiki/bin. I guess I should count myself lucky that he didn’t find any way to remove or corrupt other files that were writable by the web browser, since he managed to do it *before* the nightly backup ran.

I was running a pretty ancient version of TWiki, so it was probably long past time to upgrade. The upgrade to 4.0.5 seems to have been pretty painless. But it’s not what I wanted to be doing this morning.

Dammit!

What the hell is wrong with my colo box? For the second time in 10 days, it has gotten all weird on me and needed a reboot. This time, my “tail -F” on the various log files on my main domU was showing all sorts of ext3 errors. An attempt to log into the dom0 to reboot it got the now dreaded

ssh_exchange_identification: Connection closed by remote host

I had to call Annexa to power cycle it.

This is ridiculous. Is it the machine? The disk? The combination of Xen and lvm? I’m not finding any clues in the logs.

Today’s interesting discovery

My navaid.com web site uses a tiny bit of Ajax in order to refresh a portion of a page showing how many waypoints have been generated so far, when you’re generating a database. A couple of people reported that it wasn’t working right with IE 7. I discovered that IE 7 has attempted to implement the XMLHttpRequest the same as standards compliant browsers (Firefox, Opera, Safari), and that was my first thought. I upgrade IE on my Windows box to IE 7 and tested it, and sure enough it didn’t work right, and turning off the option that says “Enable native XMLHttpRequest support” did make it work right.

But I can’t expect every user of my site to turn off this option, so I went searching for a better answer. And I discovered something else – IE is fanatical about caching pages, no matter what the web server tells you about the age of the page. So I added the following line to my page’s javascript:

this.req.setRequestHeader(‘If-Modified-Since’,
‘Sat, 1 Jan 2000 00:00:00 GMT’);

and that seems to have fixed it. Unfortunately, because IE is so fanatical about caching stuff, I’m betting that a bunch of my users won’t see the changed net.js until they’ve already decided it doesn’t work.