Dammit!

What the hell is wrong with my colo box? For the second time in 10 days, it has gotten all weird on me and needed a reboot. This time, my “tail -F” on the various log files on my main domU was showing all sorts of ext3 errors. An attempt to log into the dom0 to reboot it got the now dreaded

ssh_exchange_identification: Connection closed by remote host

I had to call Annexa to power cycle it.

This is ridiculous. Is it the machine? The disk? The combination of Xen and lvm? I’m not finding any clues in the logs.

Today’s interesting discovery

My navaid.com web site uses a tiny bit of Ajax in order to refresh a portion of a page showing how many waypoints have been generated so far, when you’re generating a database. A couple of people reported that it wasn’t working right with IE 7. I discovered that IE 7 has attempted to implement the XMLHttpRequest the same as standards compliant browsers (Firefox, Opera, Safari), and that was my first thought. I upgrade IE on my Windows box to IE 7 and tested it, and sure enough it didn’t work right, and turning off the option that says “Enable native XMLHttpRequest support” did make it work right.

But I can’t expect every user of my site to turn off this option, so I went searching for a better answer. And I discovered something else – IE is fanatical about caching pages, no matter what the web server tells you about the age of the page. So I added the following line to my page’s javascript:

this.req.setRequestHeader(‘If-Modified-Since’,
‘Sat, 1 Jan 2000 00:00:00 GMT’);

and that seems to have fixed it. Unfortunately, because IE is so fanatical about caching stuff, I’m betting that a bunch of my users won’t see the changed net.js until they’ve already decided it doesn’t work.

Finally, something worked right!

The last thing I’m going to be moving from my linode VPS to my colo box is my Navaid.com waypoint generators. I’ve started doing some work on that – originally I was going to export the MySQL database from the linode, and massage them and import them into PostgreSQL on the colo box. But when I first started doing that, I found no end of trouble – the version of MySQL in Debian Sarge doesn’t have the “compatibility” mode in the dump command, plus I discovered that when I’d originally moved from PostgreSQL to MySQL I’d converted all the boolean fields to tinyint(1) or something, and I’d like to change that. Plus there were fields that were set to “not null default 0” which should really have allowed nulls and the like. Basic clean-up stuff, but time consuming. So I’d decided to bring the database over as MySQL, and maybe bring up the site in MySQL and write a conversion script later on so I could convert to PostgreSQL later.

Thursday evening while experimenting with some of the scripts navaid scripts on the colo site, I discovered some bad values. Investigation proved that the FAA had changed the data format for the “APT” airport record in the last load, and I hadn’t adapted my load script.

So Friday evening I went back to the “real” navaid site and fixed the load script, and ran the two scripts to reload the data. I started the run about 8pm. But it was going *really* slow. So while I was waiting, I copied the changes over to the same scripts on the colo site and ran them there. The two scripts took less than a hour to run, which pleased me immensely, and I was able to verify on that site that the damage was fixed. But the scripts on the real site were still running. I waited and finally went to bed. When I got up this morning, the first script was still running. As a matter of fact, it finally finished and went on to the second script at about 2pm. It’s slowly grinding through the second one.

But I didn’t want my “real” site to be down for this length of time. So I hit on an idea – I enabled remote connections to the MySQL database on my colo box, and made a test version of the generation script on the real site with a slightly different “DBI->connect” line as the only change. And it worked, and it worked amazingly quickly. So I changed the whole site over, and restarted Apache, and so now the web services are running on the “real” site, but the database they are hitting is on the colo box. This will make the ultimate migration easier, and it means that navaid.com’s users are already getting a bit of a speed advantage from this move.

The only hitch, and it’s a small one, is there is a script that runs once a night to expire old saved sessions. It uses subqueries, which the MySQL on my colo box doesn’t support. I’m going to have to re-write that as a script that uses a left inner join to find the ones to delete, and then deletes them one at a time. The reason why this worked on my VPS box and not on my colo box is that on the VPS, I’m connecting to a MySQL server provided by the ISP, not my own. So while Debian Sarge installs MySQL 4.0.24, the server I’m connecting to is newer than that.

Archives finally satisfactory

When I last checked in I was having a little problem rebuilding my mailman archives.

After fixing up the corrupted mailing list (by finding a backup of the config.pck file), I decided I needed to blow away and retry building the archives again.

I should mention that the main reason this was such a hassle is that the file was too big to edit in vi. Everytime I tried, the machine would slow to a crawl as vi consumed all the memory and most of the swap.

First thing I discovered is that my modified script wasn’t doing the right thing in a lot of cases. But I also discovered that in the mailman distribution is a user contributed script called "bin/cleanarch", which uses "mailbox.UnixMailbox._fromlinepattern" from another package to recognize proper From lines and only proper From lines. It even looks to see if the next line is a mail header, in case somebody decided to include the From line from a different message.

I ran my mbox through bin/cleanarch. Then I ran the mailbox splitter awk script to split it into 500 message chunks. Then I blew away the archives, and run bin/arch on each chunk in turn. This took over an hour to finish, but at least it didn’t use up all the memory on the system. But I discovered that bin/arch was getting confused about 8 or so messages from early 2000 where a few people were using a non-Y2K compliant MUA that was filling in the date with a year of “100”.

So I fixed those dates using sed, and repeated the process. An hour and a half later, I discovered a couple of cases that bin/cleanarch didn’t handle, where somebody had quoted full mail or usenet news headers from an article.

So I fixed those cases individually using sed, and repeated the process. An hour and a half later, I discovered that there was one From line I missed. At this point, I said “to hell with it” and declared myself done.

I’m starting to thing it would be really nice if Postfix were to escape From lines in the middle of a message. It knows the boundary of a message already because it deals with the envelope. I wonder if that’s an existing Postfix option? Or maybe it could be done by whatever it is in Mailman that writes to the mbox file?