NOOOOOOOOOOOOO!!!!

Yesterday, while trying to fix the problems with my mailman mailing list, I decided to rebuild the archives on the mailing list that was giving me problems. But I got the syntax of the “for i in *; do ... done” command and instead of running mailman’s arch command with carefully snipped out parts, it instead ran it with the whole archive. And that meant that arch quickly chewed up all the swap space available. I became unable to kill it, and quickly lost connections to both the domU in question, and the dom0. I couldn’t even ssh back into the dom0.

Not being clear of thought, I emailed the colo company asking if they could power cycle my box. 5 minutes later I realized that all my out-going mail goes through the colo box so it wasn’t going anywhere, and so I phoned them the request instead. They power cycled, I got control of my colo box again, and I got the list fixed up and the archives rebuilt.

But I noticed that this email to the colo company was still sitting in the outbound queue on the colo box, hours later. I didn’t think anything about it, until about 10 minutes ago I got a response to it, dated 10 minutes ago, saying “ok, I’ll power cycle it now”. I immediately fired back a “NOOOOOOOOOOO!!” email, but of course it was too late – the box went down, and now it’s back up.

And I notice that my email to the colo company is still just sitting there, with

88AB94F0AFB 1094 Tue Dec 5 10:54:49 ptomblin@xcski.com
(connect to mail.annexa.net[66.162.186.199]: Connection timed out)
annexa@annexa.net

Something tells me that email isn’t the best way to talk to these guys.

Oh oh. That’s not good.

It appears that something I did to my mailing lists yesterday may have totally fucked them up.

While I was at lunch, I got two emails, one from a member of list [PH] saying that all the email was coming tagged with the subject line tag for list [TH], and one from a member of list [TH] saying that all the list mail was coming tagged with the subject line tag for list [PH]. And so I go to the listinfo page for list [PH], and see the listinfo page for list [TH] instead. But the listinfo page for [TH] seems ok.

I looked at the archives page for [PH], and it’s totally corrupt – not just not archived correctly, I’m talking corrupt HTML here.

And the /var/log/mailman/errors file has these errors:

Dec 04 12:14:38 2006 (23894) Uncaught runner exception: could not find MARK
Dec 04 12:14:38 2006 (23894) Traceback (most recent call last):
File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 111, in _oneloop
self._onefile(msg, msgdata)
File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 167, in _onefile
keepqueued = self._dispose(mlist, msg, msgdata)
File "/usr/lib/mailman/Mailman/Queue/ArchRunner.py", line 73, in _dispose
mlist.ArchiveMail(msg)
File "/usr/lib/mailman/Mailman/Archiver/Archiver.py", line 214, in ArchiveMail
h = HyperArch.HyperArchive(self)
File "/usr/lib/mailman/Mailman/Archiver/HyperArch.py", line 599, in __init__
self.__super_init(dir, reload=1, database=db)
File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 289, in __init__
d = pickle.load(f)
UnpicklingError: could not find MARK

This can’t be good.

Time well wasted

This weekend, I’ve accomplished two major things of the four or so that I wanted to get done.

Yesterday, I moved my picture gallery from my home machine http://xcski.com/gallery/ to my colo machine at http://gallery.xcski.com/. That was surprisingly easy once I found the part of the Gallery FAQ that showed what I was doing wrong. The biggest glitch is that when I first brought it up, I had the “Square thumbnails” option turned on, and while I turned it off and told Gallery to regenerate all the thumbnails, most of them are still square. Nothing wrong with square thumbnails, but it means that it had to trim the pictures, so for example airplane pictures tend to show the middle part of the fuselage instead of the whole thing, and some full length portraits of people cut off the heads and feet. I may have to try regenerating thumbnails again.

This morning, I woke up to find my newish external USB drive that I use for backing up my colo box is dead, and so last night’s backups didn’t work. When this has happened with my other external USB drive, usually powering it down and up again fixes it, but this time it didn’t. So I did a “mkfs.ext3” on it, and started the nightly backups again. And mid-way through, the logs started filling up with errors saying that the drive was in the process of being unplugged(?!). I rebooted the server for the first time in 66 days, and that seems to have fixed it. Hopefully tonight’s backup (which will end up being huge because the old Sunday night backup is gone) will work.

Today, I’ve mostly been concentrating on trying to restore the archives to my mailman mailing lists. When I moved my mailing lists from my home server to my Virtual Private Server at linode.com, disk space was an issue to I trimmed the archives down to just the last two years. I kept the full archives on my home server just in case. And now that the lists are on my colo box, disk space isn’t an issue any more and so I tried to restore them.

My first attempt, using vim and a split screen with the old archive (which goes up to mid 2005) and the current archive (which starts in January 2005) and attempting to cut the pre-2005 stuff out of the old one into the new one didn’t work very well. I quickly bogged the machine down in extensive swapping.

So then I cut the old archive down to stop at the beginning of 2005 using

sed -n '0,/^From.*2005/p' < old.mbox > old.mbox_to_2005
head -n -1 old.mbox_to_2005 > old.mbox

and then catted the old and the current one together.

That’s when I discovered that mailman’s arch program, that regenerates archives from the mbox files, is a huge memory hog and also has a couple of bugs. First couple of times I tried to run it, it processed a few thousand messages and then died with a message about an empty module name. When I realized it was dying on the same message both times, I discovered that back in 2000 one of the mailing list users had a weird-ass bug in their mailer that was sending email with the header

Content-Type: TEXT/PLAIN; charset=".chrsc"

Evidently some sort of misconfigured character set. I used sed to change that to “us-ascii” and arch seemed a lot happier. At least until it happily consumed all the ram and most of the swap on the system. Everything dragged down to slower than a very slow thing.

I found some awk code to split up mbox files into smaller chunks, and set it to run on this huge unified archive, and then ran arch on the chunks. That mostly worked, and only slowed the server down to a not-very-slow thing, except that arch did the wrong thing on any line that started with “From ” that wasn’t the start of a mail message. I didn’t discover this until it had been running for quite some time, so I had to start again.

I used my extensive knowledge of awk (in other words, I cargo culted something) to make the mbox splitter also change any of these bogus “From ” lines into “>From “. After another hour or so of running I discovered a small bug in my splitter that meant it worked for early archives and not for later archives, probably due to a mailman upgrade or when I switched from sendmail to postfix.

So I fixed that bug, and started again. It’s been running for over an hour now, and seems to be working fine. Well, except for 5 or 6 messages that came in January 2000 that had the year set to “100”, and one place where somebody actually quoted a mailbox header without putting a “>” at the beginning. Minor inconveniences.

I guess tomorrow morning I’ll have to check that the archive is correct, and the nightly backups worked. But for now, I’m going to bed.

Today’s interesting discovery

If you want to cut the first 9854797 lines of one file, and put them at the beginning of a 3023304 line file, opening them both up in vim and cutting and pasting from one to the other is probably not the most efficient way to do it. It’s been pasting for a while, and my load average is up over 8 and I’m using 1.4Gb of swap.

Maybe I should kill this and figure out how to do it with awk or something.