Archives finally satisfactory

When I last checked in I was having a little problem rebuilding my mailman archives.

After fixing up the corrupted mailing list (by finding a backup of the config.pck file), I decided I needed to blow away and retry building the archives again.

I should mention that the main reason this was such a hassle is that the file was too big to edit in vi. Everytime I tried, the machine would slow to a crawl as vi consumed all the memory and most of the swap.

First thing I discovered is that my modified script wasn’t doing the right thing in a lot of cases. But I also discovered that in the mailman distribution is a user contributed script called "bin/cleanarch", which uses "mailbox.UnixMailbox._fromlinepattern" from another package to recognize proper From lines and only proper From lines. It even looks to see if the next line is a mail header, in case somebody decided to include the From line from a different message.

I ran my mbox through bin/cleanarch. Then I ran the mailbox splitter awk script to split it into 500 message chunks. Then I blew away the archives, and run bin/arch on each chunk in turn. This took over an hour to finish, but at least it didn’t use up all the memory on the system. But I discovered that bin/arch was getting confused about 8 or so messages from early 2000 where a few people were using a non-Y2K compliant MUA that was filling in the date with a year of “100”.

So I fixed those dates using sed, and repeated the process. An hour and a half later, I discovered a couple of cases that bin/cleanarch didn’t handle, where somebody had quoted full mail or usenet news headers from an article.

So I fixed those cases individually using sed, and repeated the process. An hour and a half later, I discovered that there was one From line I missed. At this point, I said “to hell with it” and declared myself done.

I’m starting to thing it would be really nice if Postfix were to escape From lines in the middle of a message. It knows the boundary of a message already because it deals with the envelope. I wonder if that’s an existing Postfix option? Or maybe it could be done by whatever it is in Mailman that writes to the mbox file?