Archive for December, 2006

What a pain in the ass

Wednesday, December 27th, 2006

This morning while perusing my logwatch mails I see a strange result from the script that is supposed to email me with the day’s changes from my DAFIF Replacement wiki. It was complaining about a missing perl module in the twiki/bin directory. So I look, and the twiki/bin directory is totally empty.

Some low life found a vulnerability in TWiki, and used it to remove everything in twiki/bin. I guess I should count myself lucky that he didn’t find any way to remove or corrupt other files that were writable by the web browser, since he managed to do it *before* the nightly backup ran.

I was running a pretty ancient version of TWiki, so it was probably long past time to upgrade. The upgrade to 4.0.5 seems to have been pretty painless. But it’s not what I wanted to be doing this morning.

Woo hoo!

Friday, December 22nd, 2006

My contract has been extended for the year. I can start breathing again.

1992/2006

Friday, December 22nd, 2006

In 1992, I worked for a company called GeoVision. I’d worked there for 6 years, but they were having financial problems. The previous two quarters, the end of the quarter had been the time when they announced layoffs. And just like the previous two end of quarters, the bean counters from both the Ottawa and Denver offices were huddled together the day before, and this time they came around with a list and told everybody whether they had to go to the 2pm meeting or the 3pm meeting. I was invited to the 2pm meeting. It turned out that everybody invited to the 2pm meeting was laid off, and the 3pm meeting was to announce that they’d had to do this to ensure the continued health of the company (it didn’t work - 6 months later they were out of business).

Now flash forward to 2006. I’m on a contract at Kodak. I’ve been there for 4.5 years on this contract, and I was in a previous contract in the same office for 3 years. Kodak, as everybody knows, has been shrinking for decades. And they announced that our group (Entertainment Imaging) has to shrink by 10% (they’ve offerred the voluntary retirement package (called “getting tapped”) to certain eligible job categories, then next year if they haven’t met their targets they’ll fire some people) and also it’s becoming part of the Film Products Group (which really inspires confidence that our digital project is going to be a high priority). And then today, just to make my heart rate soar, they announced that there are problems extending our contracts, and the boss set up a series of meeting to “talk with each of you on Friday regarding our decision to extend your contract or not for 2007″. And I got one of the early ones.

Can you tell I’m not going to sleep well tonight?

Dammit!

Thursday, December 14th, 2006

What the hell is wrong with my colo box? For the second time in 10 days, it has gotten all weird on me and needed a reboot. This time, my “tail -F” on the various log files on my main domU was showing all sorts of ext3 errors. An attempt to log into the dom0 to reboot it got the now dreaded

ssh_exchange_identification: Connection closed by remote host

I had to call Annexa to power cycle it.

This is ridiculous. Is it the machine? The disk? The combination of Xen and lvm? I’m not finding any clues in the logs.

Today’s interesting discovery

Thursday, December 14th, 2006

My navaid.com web site uses a tiny bit of Ajax in order to refresh a portion of a page showing how many waypoints have been generated so far, when you’re generating a database. A couple of people reported that it wasn’t working right with IE 7. I discovered that IE 7 has attempted to implement the XMLHttpRequest the same as standards compliant browsers (Firefox, Opera, Safari), and that was my first thought. I upgrade IE on my Windows box to IE 7 and tested it, and sure enough it didn’t work right, and turning off the option that says “Enable native XMLHttpRequest support” did make it work right.

But I can’t expect every user of my site to turn off this option, so I went searching for a better answer. And I discovered something else - IE is fanatical about caching pages, no matter what the web server tells you about the age of the page. So I added the following line to my page’s javascript:

this.req.setRequestHeader(’If-Modified-Since’,
‘Sat, 1 Jan 2000 00:00:00 GMT’);

and that seems to have fixed it. Unfortunately, because IE is so fanatical about caching stuff, I’m betting that a bunch of my users won’t see the changed net.js until they’ve already decided it doesn’t work.

Thought of the day

Monday, December 11th, 2006

Somebody needs to make a “ball in a cup” game for the Wii.

That is all.

Finally, something worked right!

Saturday, December 9th, 2006

The last thing I’m going to be moving from my linode VPS to my colo box is my Navaid.com waypoint generators. I’ve started doing some work on that - originally I was going to export the MySQL database from the linode, and massage them and import them into PostgreSQL on the colo box. But when I first started doing that, I found no end of trouble - the version of MySQL in Debian Sarge doesn’t have the “compatibility” mode in the dump command, plus I discovered that when I’d originally moved from PostgreSQL to MySQL I’d converted all the boolean fields to tinyint(1) or something, and I’d like to change that. Plus there were fields that were set to “not null default 0″ which should really have allowed nulls and the like. Basic clean-up stuff, but time consuming. So I’d decided to bring the database over as MySQL, and maybe bring up the site in MySQL and write a conversion script later on so I could convert to PostgreSQL later.

Thursday evening while experimenting with some of the scripts navaid scripts on the colo site, I discovered some bad values. Investigation proved that the FAA had changed the data format for the “APT” airport record in the last load, and I hadn’t adapted my load script.

So Friday evening I went back to the “real” navaid site and fixed the load script, and ran the two scripts to reload the data. I started the run about 8pm. But it was going *really* slow. So while I was waiting, I copied the changes over to the same scripts on the colo site and ran them there. The two scripts took less than a hour to run, which pleased me immensely, and I was able to verify on that site that the damage was fixed. But the scripts on the real site were still running. I waited and finally went to bed. When I got up this morning, the first script was still running. As a matter of fact, it finally finished and went on to the second script at about 2pm. It’s slowly grinding through the second one.

But I didn’t want my “real” site to be down for this length of time. So I hit on an idea - I enabled remote connections to the MySQL database on my colo box, and made a test version of the generation script on the real site with a slightly different “DBI->connect” line as the only change. And it worked, and it worked amazingly quickly. So I changed the whole site over, and restarted Apache, and so now the web services are running on the “real” site, but the database they are hitting is on the colo box. This will make the ultimate migration easier, and it means that navaid.com’s users are already getting a bit of a speed advantage from this move.

The only hitch, and it’s a small one, is there is a script that runs once a night to expire old saved sessions. It uses subqueries, which the MySQL on my colo box doesn’t support. I’m going to have to re-write that as a script that uses a left inner join to find the ones to delete, and then deletes them one at a time. The reason why this worked on my VPS box and not on my colo box is that on the VPS, I’m connecting to a MySQL server provided by the ISP, not my own. So while Debian Sarge installs MySQL 4.0.24, the server I’m connecting to is newer than that.

Reason #9729 why Paul should not be allowed to go to work without his iPod:

Friday, December 8th, 2006

He starts free-associating song lyrics

“There’s a Hindu Kush
All over the world, tonight…”

Archives finally satisfactory

Thursday, December 7th, 2006

When I last checked in I was having a little problem rebuilding my mailman archives.

After fixing up the corrupted mailing list (by finding a backup of the config.pck file), I decided I needed to blow away and retry building the archives again.

I should mention that the main reason this was such a hassle is that the file was too big to edit in vi. Everytime I tried, the machine would slow to a crawl as vi consumed all the memory and most of the swap.

First thing I discovered is that my modified script wasn’t doing the right thing in a lot of cases. But I also discovered that in the mailman distribution is a user contributed script called "bin/cleanarch", which uses "mailbox.UnixMailbox._fromlinepattern" from another package to recognize proper From lines and only proper From lines. It even looks to see if the next line is a mail header, in case somebody decided to include the From line from a different message.

I ran my mbox through bin/cleanarch. Then I ran the mailbox splitter awk script to split it into 500 message chunks. Then I blew away the archives, and run bin/arch on each chunk in turn. This took over an hour to finish, but at least it didn’t use up all the memory on the system. But I discovered that bin/arch was getting confused about 8 or so messages from early 2000 where a few people were using a non-Y2K compliant MUA that was filling in the date with a year of “100″.

So I fixed those dates using sed, and repeated the process. An hour and a half later, I discovered a couple of cases that bin/cleanarch didn’t handle, where somebody had quoted full mail or usenet news headers from an article.

So I fixed those cases individually using sed, and repeated the process. An hour and a half later, I discovered that there was one From line I missed. At this point, I said “to hell with it” and declared myself done.

I’m starting to thing it would be really nice if Postfix were to escape From lines in the middle of a message. It knows the boundary of a message already because it deals with the envelope. I wonder if that’s an existing Postfix option? Or maybe it could be done by whatever it is in Mailman that writes to the mbox file?

I thought I was the mightly debugging king…

Wednesday, December 6th, 2006

…but I just handed my debugging crown to him.

Kris and I have been banging our heads on our desks because of problems we’re having with our JTreeTables. A JTreeTable is a class that we found on a Sun Java forum that combines the attributes of the JTable with a JTree - basically giving you JTree behaviour with columnar (table) data. It’s really handy. But frequently, and often in cases I could easily reproduce, the damn thing wasn’t updating correctly. Kris and I both made sure that our updates were properly protected by synchronization locks, and the events were being fired in the event loop, lessons we’ve both learned by hard experience. But it was still acting strangely. Kris spent a lot of time reading forum posts, running the debugger down deep in Java library code, and basically working this problem from all angles for days upon days.

Yesterday he found the problem. And the problem was in my code. When you fire an event, you need to give it an array of Objects that starts at the root node of the tree, and follows down through the tree to the node that actually changed. But of course, when you actually change a node, you’re already at the node that changed, and it’s pretty easy to trace up from node.parent() to node.parent() until you reach the top, so that’s what I do. And then I attempt to reverse the order of what I’ve got to make the required array. But it appears that I fundamentally misunderstood the Stack class, because pushing objects on the Stack and then doing a “toArray” on it doesn’t reverse the order, as I’d thought. So the view was getting a totally messed up event, and that was messing everything else up.

Kris changed my Stack.push into ArrayList.add(0, node), and everything works now. And I never thought about doing it that way because I thought List.add(0, Object) would replace the object at position 0, not push them all up.

And Kris’s small change (after big effort) closes three bug reports assigned to me, and a bunch that were assigned to him.

In my defence, I’d actually come up with the Stack thing while Vicki was driving us to Pittsburgh. So perhaps it wasn’t my best work.

NOOOOOOOOOOOOO!!!!

Tuesday, December 5th, 2006

Yesterday, while trying to fix the problems with my mailman mailing list, I decided to rebuild the archives on the mailing list that was giving me problems. But I got the syntax of the “for i in *; do ... done” command and instead of running mailman’s arch command with carefully snipped out parts, it instead ran it with the whole archive. And that meant that arch quickly chewed up all the swap space available. I became unable to kill it, and quickly lost connections to both the domU in question, and the dom0. I couldn’t even ssh back into the dom0.

Not being clear of thought, I emailed the colo company asking if they could power cycle my box. 5 minutes later I realized that all my out-going mail goes through the colo box so it wasn’t going anywhere, and so I phoned them the request instead. They power cycled, I got control of my colo box again, and I got the list fixed up and the archives rebuilt.

But I noticed that this email to the colo company was still sitting in the outbound queue on the colo box, hours later. I didn’t think anything about it, until about 10 minutes ago I got a response to it, dated 10 minutes ago, saying “ok, I’ll power cycle it now”. I immediately fired back a “NOOOOOOOOOOO!!” email, but of course it was too late - the box went down, and now it’s back up.

And I notice that my email to the colo company is still just sitting there, with

88AB94F0AFB 1094 Tue Dec 5 10:54:49 ptomblin@xcski.com
(connect to mail.annexa.net[66.162.186.199]: Connection timed out)
annexa@annexa.net

Something tells me that email isn’t the best way to talk to these guys.

Oh oh. That’s not good.

Monday, December 4th, 2006

It appears that something I did to my mailing lists yesterday may have totally fucked them up.

While I was at lunch, I got two emails, one from a member of list [PH] saying that all the email was coming tagged with the subject line tag for list [TH], and one from a member of list [TH] saying that all the list mail was coming tagged with the subject line tag for list [PH]. And so I go to the listinfo page for list [PH], and see the listinfo page for list [TH] instead. But the listinfo page for [TH] seems ok.

I looked at the archives page for [PH], and it’s totally corrupt - not just not archived correctly, I’m talking corrupt HTML here.

And the /var/log/mailman/errors file has these errors:

Dec 04 12:14:38 2006 (23894) Uncaught runner exception: could not find MARK
Dec 04 12:14:38 2006 (23894) Traceback (most recent call last):
File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 111, in _oneloop
self._onefile(msg, msgdata)
File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 167, in _onefile
keepqueued = self._dispose(mlist, msg, msgdata)
File "/usr/lib/mailman/Mailman/Queue/ArchRunner.py", line 73, in _dispose
mlist.ArchiveMail(msg)
File "/usr/lib/mailman/Mailman/Archiver/Archiver.py", line 214, in ArchiveMail
h = HyperArch.HyperArchive(self)
File "/usr/lib/mailman/Mailman/Archiver/HyperArch.py", line 599, in __init__
self.__super_init(dir, reload=1, database=db)
File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 289, in __init__
d = pickle.load(f)
UnpicklingError: could not find MARK

This can’t be good.

My life in a nutshell

Monday, December 4th, 2006

I wonder how long this image will continue to be there:

Time well wasted

Monday, December 4th, 2006

This weekend, I’ve accomplished two major things of the four or so that I wanted to get done.

Yesterday, I moved my picture gallery from my home machine http://xcski.com/gallery/ to my colo machine at http://gallery.xcski.com/. That was surprisingly easy once I found the part of the Gallery FAQ that showed what I was doing wrong. The biggest glitch is that when I first brought it up, I had the “Square thumbnails” option turned on, and while I turned it off and told Gallery to regenerate all the thumbnails, most of them are still square. Nothing wrong with square thumbnails, but it means that it had to trim the pictures, so for example airplane pictures tend to show the middle part of the fuselage instead of the whole thing, and some full length portraits of people cut off the heads and feet. I may have to try regenerating thumbnails again.

This morning, I woke up to find my newish external USB drive that I use for backing up my colo box is dead, and so last night’s backups didn’t work. When this has happened with my other external USB drive, usually powering it down and up again fixes it, but this time it didn’t. So I did a “mkfs.ext3″ on it, and started the nightly backups again. And mid-way through, the logs started filling up with errors saying that the drive was in the process of being unplugged(?!). I rebooted the server for the first time in 66 days, and that seems to have fixed it. Hopefully tonight’s backup (which will end up being huge because the old Sunday night backup is gone) will work.

Today, I’ve mostly been concentrating on trying to restore the archives to my mailman mailing lists. When I moved my mailing lists from my home server to my Virtual Private Server at linode.com, disk space was an issue to I trimmed the archives down to just the last two years. I kept the full archives on my home server just in case. And now that the lists are on my colo box, disk space isn’t an issue any more and so I tried to restore them.

My first attempt, using vim and a split screen with the old archive (which goes up to mid 2005) and the current archive (which starts in January 2005) and attempting to cut the pre-2005 stuff out of the old one into the new one didn’t work very well. I quickly bogged the machine down in extensive swapping.

So then I cut the old archive down to stop at the beginning of 2005 using

sed -n '0,/^From.*2005/p' < old.mbox > old.mbox_to_2005
head -n -1 old.mbox_to_2005 > old.mbox

and then catted the old and the current one together.

That’s when I discovered that mailman’s arch program, that regenerates archives from the mbox files, is a huge memory hog and also has a couple of bugs. First couple of times I tried to run it, it processed a few thousand messages and then died with a message about an empty module name. When I realized it was dying on the same message both times, I discovered that back in 2000 one of the mailing list users had a weird-ass bug in their mailer that was sending email with the header

Content-Type: TEXT/PLAIN; charset=".chrsc"

Evidently some sort of misconfigured character set. I used sed to change that to “us-ascii” and arch seemed a lot happier. At least until it happily consumed all the ram and most of the swap on the system. Everything dragged down to slower than a very slow thing.

I found some awk code to split up mbox files into smaller chunks, and set it to run on this huge unified archive, and then ran arch on the chunks. That mostly worked, and only slowed the server down to a not-very-slow thing, except that arch did the wrong thing on any line that started with “From ” that wasn’t the start of a mail message. I didn’t discover this until it had been running for quite some time, so I had to start again.

I used my extensive knowledge of awk (in other words, I cargo culted something) to make the mbox splitter also change any of these bogus “From ” lines into “>From “. After another hour or so of running I discovered a small bug in my splitter that meant it worked for early archives and not for later archives, probably due to a mailman upgrade or when I switched from sendmail to postfix.

So I fixed that bug, and started again. It’s been running for over an hour now, and seems to be working fine. Well, except for 5 or 6 messages that came in January 2000 that had the year set to “100″, and one place where somebody actually quoted a mailbox header without putting a “>” at the beginning. Minor inconveniences.

I guess tomorrow morning I’ll have to check that the archive is correct, and the nightly backups worked. But for now, I’m going to bed.

Today’s interesting discovery

Sunday, December 3rd, 2006

If you want to cut the first 9854797 lines of one file, and put them at the beginning of a 3023304 line file, opening them both up in vim and cutting and pasting from one to the other is probably not the most efficient way to do it. It’s been pasting for a while, and my load average is up over 8 and I’m using 1.4Gb of swap.

Maybe I should kill this and figure out how to do it with awk or something.