Archives finally satisfactory

When I last checked in I was having a little problem rebuilding my mailman archives.

After fixing up the corrupted mailing list (by finding a backup of the config.pck file), I decided I needed to blow away and retry building the archives again.

I should mention that the main reason this was such a hassle is that the file was too big to edit in vi. Everytime I tried, the machine would slow to a crawl as vi consumed all the memory and most of the swap.

First thing I discovered is that my modified script wasn’t doing the right thing in a lot of cases. But I also discovered that in the mailman distribution is a user contributed script called "bin/cleanarch", which uses "mailbox.UnixMailbox._fromlinepattern" from another package to recognize proper From lines and only proper From lines. It even looks to see if the next line is a mail header, in case somebody decided to include the From line from a different message.

I ran my mbox through bin/cleanarch. Then I ran the mailbox splitter awk script to split it into 500 message chunks. Then I blew away the archives, and run bin/arch on each chunk in turn. This took over an hour to finish, but at least it didn’t use up all the memory on the system. But I discovered that bin/arch was getting confused about 8 or so messages from early 2000 where a few people were using a non-Y2K compliant MUA that was filling in the date with a year of “100”.

So I fixed those dates using sed, and repeated the process. An hour and a half later, I discovered a couple of cases that bin/cleanarch didn’t handle, where somebody had quoted full mail or usenet news headers from an article.

So I fixed those cases individually using sed, and repeated the process. An hour and a half later, I discovered that there was one From line I missed. At this point, I said “to hell with it” and declared myself done.

I’m starting to thing it would be really nice if Postfix were to escape From lines in the middle of a message. It knows the boundary of a message already because it deals with the envelope. I wonder if that’s an existing Postfix option? Or maybe it could be done by whatever it is in Mailman that writes to the mbox file?

One thought on “Archives finally satisfactory”

  1. You can give –start and –end to arch to process the archive in chunks without having to split the mbox up – see arch –help for full details.

Comments are closed.