1992/2006

In 1992, I worked for a company called GeoVision. I’d worked there for 6 years, but they were having financial problems. The previous two quarters, the end of the quarter had been the time when they announced layoffs. And just like the previous two end of quarters, the bean counters from both the Ottawa and Denver offices were huddled together the day before, and this time they came around with a list and told everybody whether they had to go to the 2pm meeting or the 3pm meeting. I was invited to the 2pm meeting. It turned out that everybody invited to the 2pm meeting was laid off, and the 3pm meeting was to announce that they’d had to do this to ensure the continued health of the company (it didn’t work – 6 months later they were out of business).

Now flash forward to 2006. I’m on a contract at $EMPLOYER. I’ve been there for 4.5 years on this contract, and I was in a previous contract in the same office for 3 years. $EMPLOYER, as everybody knows, has been shrinking for decades. And they announced that our group (Entertainment Imaging) has to shrink by 10% (they’ve offerred the voluntary retirement package (called “getting tapped”) to certain eligible job categories, then next year if they haven’t met their targets they’ll fire some people) and also it’s becoming part of the Film Products Group (which really inspires confidence that our digital project is going to be a high priority). And then today, just to make my heart rate soar, they announced that there are problems extending our contracts, and the boss set up a series of meeting to “talk with each of you on Friday regarding our decision to extend your contract or not for 2007”. And I got one of the early ones.

Can you tell I’m not going to sleep well tonight?

Dammit!

What the hell is wrong with my colo box? For the second time in 10 days, it has gotten all weird on me and needed a reboot. This time, my “tail -F” on the various log files on my main domU was showing all sorts of ext3 errors. An attempt to log into the dom0 to reboot it got the now dreaded

ssh_exchange_identification: Connection closed by remote host

I had to call Annexa to power cycle it.

This is ridiculous. Is it the machine? The disk? The combination of Xen and lvm? I’m not finding any clues in the logs.

Archives finally satisfactory

When I last checked in I was having a little problem rebuilding my mailman archives.

After fixing up the corrupted mailing list (by finding a backup of the config.pck file), I decided I needed to blow away and retry building the archives again.

I should mention that the main reason this was such a hassle is that the file was too big to edit in vi. Everytime I tried, the machine would slow to a crawl as vi consumed all the memory and most of the swap.

First thing I discovered is that my modified script wasn’t doing the right thing in a lot of cases. But I also discovered that in the mailman distribution is a user contributed script called "bin/cleanarch", which uses "mailbox.UnixMailbox._fromlinepattern" from another package to recognize proper From lines and only proper From lines. It even looks to see if the next line is a mail header, in case somebody decided to include the From line from a different message.

I ran my mbox through bin/cleanarch. Then I ran the mailbox splitter awk script to split it into 500 message chunks. Then I blew away the archives, and run bin/arch on each chunk in turn. This took over an hour to finish, but at least it didn’t use up all the memory on the system. But I discovered that bin/arch was getting confused about 8 or so messages from early 2000 where a few people were using a non-Y2K compliant MUA that was filling in the date with a year of “100”.

So I fixed those dates using sed, and repeated the process. An hour and a half later, I discovered a couple of cases that bin/cleanarch didn’t handle, where somebody had quoted full mail or usenet news headers from an article.

So I fixed those cases individually using sed, and repeated the process. An hour and a half later, I discovered that there was one From line I missed. At this point, I said “to hell with it” and declared myself done.

I’m starting to thing it would be really nice if Postfix were to escape From lines in the middle of a message. It knows the boundary of a message already because it deals with the envelope. I wonder if that’s an existing Postfix option? Or maybe it could be done by whatever it is in Mailman that writes to the mbox file?

I thought I was the mightly debugging king…

…but I just handed my debugging crown to him.

Kris and I have been banging our heads on our desks because of problems we’re having with our JTreeTables. A JTreeTable is a class that we found on a Sun Java forum that combines the attributes of the JTable with a JTree – basically giving you JTree behaviour with columnar (table) data. It’s really handy. But frequently, and often in cases I could easily reproduce, the damn thing wasn’t updating correctly. Kris and I both made sure that our updates were properly protected by synchronization locks, and the events were being fired in the event loop, lessons we’ve both learned by hard experience. But it was still acting strangely. Kris spent a lot of time reading forum posts, running the debugger down deep in Java library code, and basically working this problem from all angles for days upon days.

Yesterday he found the problem. And the problem was in my code. When you fire an event, you need to give it an array of Objects that starts at the root node of the tree, and follows down through the tree to the node that actually changed. But of course, when you actually change a node, you’re already at the node that changed, and it’s pretty easy to trace up from node.parent() to node.parent() until you reach the top, so that’s what I do. And then I attempt to reverse the order of what I’ve got to make the required array. But it appears that I fundamentally misunderstood the Stack class, because pushing objects on the Stack and then doing a “toArray” on it doesn’t reverse the order, as I’d thought. So the view was getting a totally messed up event, and that was messing everything else up.

Kris changed my Stack.push into ArrayList.add(0, node), and everything works now. And I never thought about doing it that way because I thought List.add(0, Object) would replace the object at position 0, not push them all up.

And Kris’s small change (after big effort) closes three bug reports assigned to me, and a bunch that were assigned to him.

In my defence, I’d actually come up with the Stack thing while Vicki was driving us to Pittsburgh. So perhaps it wasn’t my best work.

NOOOOOOOOOOOOO!!!!

Yesterday, while trying to fix the problems with my mailman mailing list, I decided to rebuild the archives on the mailing list that was giving me problems. But I got the syntax of the “for i in *; do ... done” command and instead of running mailman’s arch command with carefully snipped out parts, it instead ran it with the whole archive. And that meant that arch quickly chewed up all the swap space available. I became unable to kill it, and quickly lost connections to both the domU in question, and the dom0. I couldn’t even ssh back into the dom0.

Not being clear of thought, I emailed the colo company asking if they could power cycle my box. 5 minutes later I realized that all my out-going mail goes through the colo box so it wasn’t going anywhere, and so I phoned them the request instead. They power cycled, I got control of my colo box again, and I got the list fixed up and the archives rebuilt.

But I noticed that this email to the colo company was still sitting in the outbound queue on the colo box, hours later. I didn’t think anything about it, until about 10 minutes ago I got a response to it, dated 10 minutes ago, saying “ok, I’ll power cycle it now”. I immediately fired back a “NOOOOOOOOOOO!!” email, but of course it was too late – the box went down, and now it’s back up.

And I notice that my email to the colo company is still just sitting there, with

88AB94F0AFB 1094 Tue Dec 5 10:54:49 ptomblin@xcski.com
(connect to mail.annexa.net[66.162.186.199]: Connection timed out)
annexa@annexa.net

Something tells me that email isn’t the best way to talk to these guys.