Rant – Page 96 – Rants and Revelations

The ultimate Heisenbug?

We’ve got a problem that happens apparently at random times at a few customer sites, but which we’ve been unable to reproduce in the lab. I’m not sure if that means it’s a Heisenbug or just a really nasty Bohr-bug.

The part of the system that is affected are three programs:

One that generates events, called “tixd“
One that is responsible for collecting events from all the programs in the system (not just these three) and delivering them to subscribers, called the “EventBroker” or “eb“
One that subscribes to the events that the “tixd” generates, which we call the “scheduled“

What has been happening on these customer sites is that after days or weeks of proper operation, for no apparent reason, the “tixd” would say that it’s generating an event, but the “scheduled” wasn’t getting them any more. The customer would notice the problem, sometimes a day or two later, complain that things weren’t happening that were supposed to happen, our service people would restart the whole system, and everything would start working again.

This bug has been happening for ages now, and every time I get called in to look at their logs because I wrote the “scheduled” and all the fingers point to me. But I couldn’t find any reason why “scheduled” would stop responding to events, or would unsubscribe from events. A few builds ago, Tom put some debug into his “eb” that would log every event that came in and which subscribers it was being delivered to. He also logged subscribes and unsubscribes. And so we waited.

Today, it finally happened again. And this time, I’ve got the logs that show:

At 6am, an event is generated by the tixd, and the eb delivers it to the subscriber scheduled
Between 10am and 11am, there is a flurry of event subscribes and unsubscribes, all unrelated to scheduled. But some of these unsubscribes are caused when events are being delivered to subscribers that have exited without unsubscribing.
At about 1am, there is another event generated by the tixd, and the eb receives it but says there are no subscribers found.

At this point, because the eb log shows no unsubscribe coming from scheduled, I’d say it’s not my bug and pass it off to Tom, the author of the eb. But unfortunately, my employer declined to renew Tom’s contract at the end of last year, so he no longer works here. He dodged this bullet by only 5 days. And so I’ve got to figure out why this is happening. Lucky me.

Another damn failure on my colo box

Yesterday I had a panic attack – suddenly a certain repository of binary files was empty where it had been nearly 300Gb a few hours previously. I knew that Vicki was uploading some stuff to it today using an “ncftpput” command that I’d shown her but which I knew she didn’t understand what each command line argument meant. So I’m sorry to say my first reaction was “I bet she somehow wiped it”. But I looked in her ~/.ncftp/spool/log file and couldn’t see anything unusual. I guess I owe her an apology for that thought.

I looked on the domU, and “df” showed the partition still mounted, and still 91% full. But nothing showed up to “ls”. I unmounted it and shut off the nightly backups so that it doesn’t delete the backup. “fsck.ext3 /dev/hdb” gave an error about a zero length partition. Then I thought I should probably be doing this on the dom0, and so I logged into it and had the same error with “fsck.ext3 /dev/hdb1”. “fsck -l /dev/hdb” on that entire drive showed that it didn’t think the drive was there at all. Oh oh. Moment of panic time – one of the other domUs has some of its disk space on that drive as well, thanks to LVM. I wonder what’s screwing up on his domU if it can’t get to some of its disk space. Time to shut them all down and reboot.

I did an “xm console xen1” to connect to my domU and that’s where I saw the oh-so-familiar ext3 errors. But everything shut down relatively cleanly and rebooted. I saw one message in the log files about resetting the ide0 controller, which I’m not sure was caused by or the cause of the problem. And after the reboot, all 300Gb of files were back. Thank goodness, because the upload bandwidth I’ve got at home these days means it would have taken months to get that partition restored from my backup.

This partition that screwed up this time is a normal primary disk partition, not an LVM logical volume, and on a different physical drive than the other failures, so at least I’ve eliminated LVM and the disk hardware as a cause. But that leaves the IDE controller and Xen.

I can’t wait for my new 1U server to come. Still not sure whether I should try Xen again or VMWare. VMWare probably isn’t as fast and it’s a lot more difficult to manage without getting the for-pay version, but at least it’s “ready for prime time”.

What a pain in the ass

This morning while perusing my logwatch mails I see a strange result from the script that is supposed to email me with the day’s changes from my DAFIF Replacement wiki. It was complaining about a missing perl module in the twiki/bin directory. So I look, and the twiki/bin directory is totally empty.

Some low life found a vulnerability in TWiki, and used it to remove everything in twiki/bin. I guess I should count myself lucky that he didn’t find any way to remove or corrupt other files that were writable by the web browser, since he managed to do it *before* the nightly backup ran.

I was running a pretty ancient version of TWiki, so it was probably long past time to upgrade. The upgrade to 4.0.5 seems to have been pretty painless. But it’s not what I wanted to be doing this morning.

1992/2006

In 1992, I worked for a company called GeoVision. I’d worked there for 6 years, but they were having financial problems. The previous two quarters, the end of the quarter had been the time when they announced layoffs. And just like the previous two end of quarters, the bean counters from both the Ottawa and Denver offices were huddled together the day before, and this time they came around with a list and told everybody whether they had to go to the 2pm meeting or the 3pm meeting. I was invited to the 2pm meeting. It turned out that everybody invited to the 2pm meeting was laid off, and the 3pm meeting was to announce that they’d had to do this to ensure the continued health of the company (it didn’t work – 6 months later they were out of business).

Now flash forward to 2006. I’m on a contract at $EMPLOYER. I’ve been there for 4.5 years on this contract, and I was in a previous contract in the same office for 3 years. $EMPLOYER, as everybody knows, has been shrinking for decades. And they announced that our group (Entertainment Imaging) has to shrink by 10% (they’ve offerred the voluntary retirement package (called “getting tapped”) to certain eligible job categories, then next year if they haven’t met their targets they’ll fire some people) and also it’s becoming part of the Film Products Group (which really inspires confidence that our digital project is going to be a high priority). And then today, just to make my heart rate soar, they announced that there are problems extending our contracts, and the boss set up a series of meeting to “talk with each of you on Friday regarding our decision to extend your contract or not for 2007”. And I got one of the early ones.

Can you tell I’m not going to sleep well tonight?

Dammit!

What the hell is wrong with my colo box? For the second time in 10 days, it has gotten all weird on me and needed a reboot. This time, my “tail -F” on the various log files on my main domU was showing all sorts of ext3 errors. An attempt to log into the dom0 to reboot it got the now dreaded

ssh_exchange_identification: Connection closed by remote host

I had to call Annexa to power cycle it.

This is ridiculous. Is it the machine? The disk? The combination of Xen and lvm? I’m not finding any clues in the logs.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31