Another damn failure on my colo box

Yesterday I had a panic attack – suddenly a certain repository of binary files was empty where it had been nearly 300Gb a few hours previously. I knew that Vicki was uploading some stuff to it today using an “ncftpput” command that I’d shown her but which I knew she didn’t understand what each command line argument meant. So I’m sorry to say my first reaction was “I bet she somehow wiped it”. But I looked in her ~/.ncftp/spool/log file and couldn’t see anything unusual. I guess I owe her an apology for that thought.

I looked on the domU, and “df” showed the partition still mounted, and still 91% full. But nothing showed up to “ls”. I unmounted it and shut off the nightly backups so that it doesn’t delete the backup. “fsck.ext3 /dev/hdb” gave an error about a zero length partition. Then I thought I should probably be doing this on the dom0, and so I logged into it and had the same error with “fsck.ext3 /dev/hdb1”. “fsck -l /dev/hdb” on that entire drive showed that it didn’t think the drive was there at all. Oh oh. Moment of panic time – one of the other domUs has some of its disk space on that drive as well, thanks to LVM. I wonder what’s screwing up on his domU if it can’t get to some of its disk space. Time to shut them all down and reboot.

I did an “xm console xen1” to connect to my domU and that’s where I saw the oh-so-familiar ext3 errors. But everything shut down relatively cleanly and rebooted. I saw one message in the log files about resetting the ide0 controller, which I’m not sure was caused by or the cause of the problem. And after the reboot, all 300Gb of files were back. Thank goodness, because the upload bandwidth I’ve got at home these days means it would have taken months to get that partition restored from my backup.

This partition that screwed up this time is a normal primary disk partition, not an LVM logical volume, and on a different physical drive than the other failures, so at least I’ve eliminated LVM and the disk hardware as a cause. But that leaves the IDE controller and Xen.

I can’t wait for my new 1U server to come. Still not sure whether I should try Xen again or VMWare. VMWare probably isn’t as fast and it’s a lot more difficult to manage without getting the for-pay version, but at least it’s “ready for prime time”.

What a pain in the ass

This morning while perusing my logwatch mails I see a strange result from the script that is supposed to email me with the day’s changes from my DAFIF Replacement wiki. It was complaining about a missing perl module in the twiki/bin directory. So I look, and the twiki/bin directory is totally empty.

Some low life found a vulnerability in TWiki, and used it to remove everything in twiki/bin. I guess I should count myself lucky that he didn’t find any way to remove or corrupt other files that were writable by the web browser, since he managed to do it *before* the nightly backup ran.

I was running a pretty ancient version of TWiki, so it was probably long past time to upgrade. The upgrade to 4.0.5 seems to have been pretty painless. But it’s not what I wanted to be doing this morning.

1992/2006

In 1992, I worked for a company called GeoVision. I’d worked there for 6 years, but they were having financial problems. The previous two quarters, the end of the quarter had been the time when they announced layoffs. And just like the previous two end of quarters, the bean counters from both the Ottawa and Denver offices were huddled together the day before, and this time they came around with a list and told everybody whether they had to go to the 2pm meeting or the 3pm meeting. I was invited to the 2pm meeting. It turned out that everybody invited to the 2pm meeting was laid off, and the 3pm meeting was to announce that they’d had to do this to ensure the continued health of the company (it didn’t work – 6 months later they were out of business).

Now flash forward to 2006. I’m on a contract at $EMPLOYER. I’ve been there for 4.5 years on this contract, and I was in a previous contract in the same office for 3 years. $EMPLOYER, as everybody knows, has been shrinking for decades. And they announced that our group (Entertainment Imaging) has to shrink by 10% (they’ve offerred the voluntary retirement package (called “getting tapped”) to certain eligible job categories, then next year if they haven’t met their targets they’ll fire some people) and also it’s becoming part of the Film Products Group (which really inspires confidence that our digital project is going to be a high priority). And then today, just to make my heart rate soar, they announced that there are problems extending our contracts, and the boss set up a series of meeting to “talk with each of you on Friday regarding our decision to extend your contract or not for 2007”. And I got one of the early ones.

Can you tell I’m not going to sleep well tonight?

Dammit!

What the hell is wrong with my colo box? For the second time in 10 days, it has gotten all weird on me and needed a reboot. This time, my “tail -F” on the various log files on my main domU was showing all sorts of ext3 errors. An attempt to log into the dom0 to reboot it got the now dreaded

ssh_exchange_identification: Connection closed by remote host

I had to call Annexa to power cycle it.

This is ridiculous. Is it the machine? The disk? The combination of Xen and lvm? I’m not finding any clues in the logs.

Archives finally satisfactory

When I last checked in I was having a little problem rebuilding my mailman archives.

After fixing up the corrupted mailing list (by finding a backup of the config.pck file), I decided I needed to blow away and retry building the archives again.

I should mention that the main reason this was such a hassle is that the file was too big to edit in vi. Everytime I tried, the machine would slow to a crawl as vi consumed all the memory and most of the swap.

First thing I discovered is that my modified script wasn’t doing the right thing in a lot of cases. But I also discovered that in the mailman distribution is a user contributed script called "bin/cleanarch", which uses "mailbox.UnixMailbox._fromlinepattern" from another package to recognize proper From lines and only proper From lines. It even looks to see if the next line is a mail header, in case somebody decided to include the From line from a different message.

I ran my mbox through bin/cleanarch. Then I ran the mailbox splitter awk script to split it into 500 message chunks. Then I blew away the archives, and run bin/arch on each chunk in turn. This took over an hour to finish, but at least it didn’t use up all the memory on the system. But I discovered that bin/arch was getting confused about 8 or so messages from early 2000 where a few people were using a non-Y2K compliant MUA that was filling in the date with a year of “100”.

So I fixed those dates using sed, and repeated the process. An hour and a half later, I discovered a couple of cases that bin/cleanarch didn’t handle, where somebody had quoted full mail or usenet news headers from an article.

So I fixed those cases individually using sed, and repeated the process. An hour and a half later, I discovered that there was one From line I missed. At this point, I said “to hell with it” and declared myself done.

I’m starting to thing it would be really nice if Postfix were to escape From lines in the middle of a message. It knows the boundary of a message already because it deals with the envelope. I wonder if that’s an existing Postfix option? Or maybe it could be done by whatever it is in Mailman that writes to the mbox file?