What is Hitachi thinking?

For years now, whenever I’ve had drive or controller problems, I’ve hauled out IBM’s DFT (Drive Fitness Test), even if the drive isn’t a DeathstarDeskstar. Now IBM’s drive division belongs to Hitachi, but DFT lives on. I used it last week to make sure my new colo box could handle the sorts of loads I wanted to put on it. But now that I have my old colo box back, I want to test it to see if the problems I was having might be fixed with a new drive cable before I sell it on eBay.

But this box doesn’t have a floppy. No problem, I thought, the Hitachi site has a bootable CD version. So I downloaded it and burned it and booted with it. But the first thing it does it scan the IDE controllers, and when it’s scanning “Secondary Slave”, it suddenly starts spewing errors about being unable to read A:\COMMAND.COM. Evidently DFT needs to read its own disk just at the moment that the drive was disconnected for scanning. So when they made the CD ISO, they didn’t actually test it, or didn’t think about how it works, and instead of using the “Linux Live CD” model where they make a ramdisk and load themselves into it, they just made a DOS boot partition on the CD and expect it to be there all the time.

I guess it’s off to my junk shelf to see if I have a floppy drive and cable.

And the productivity hits just keep on coming

Evidently it’s company policy that working at home must be requested 24 hours in advance, in writing. So if I find myself unable to come into work for some reason, they want me to stay home and do something else rather than doing useful work on our project. Well, I’ll miss the money, but I think they’re going to miss the work more.

Ice day

When this morning’s alarm clock went off, the radio was saying that Vicki’s place of work was closed because of an overnight ice storm. I looked outside and there was a good half-inch of clear ice on the trees, roads, and my car. And the local news web sites said that the state police were telling people not to drive if they could avoid it.

So I thought about what I’d be doing if I went to work, and it was just working on design documents. I have most of the documents I needed at home, so I thought “screw it” and decided to stay home.

I wanted to email people to tell them that I was going to do that, and I only had a few of their addresses. So I emailed the ones I had, and one of them emailed my new official direct supervisor (even though I really get my job assignments and direct supervision from somebody else, but she signs my time sheets).

She wrote me back. She’s evidently mad that I didn’t follow her new procedure, and phoned her for permission *before* I decided to stay home. In the past, I’ve always been trusted to work at home if I had work that could be done at home, so this seems like a real lack of trust on her part. But then again she’s new to the project and doesn’t know any of us that well – plus she has little to no day-to-day contact with us developers, so maybe she doesn’t know us well enough to know who to trust.

So instead of having a nice day at home where I could work productively but in a relaxed environment, I had to struggle to produce work while worrying if I’d just jeopardized my job.

Just for the record, I got more work done than I would have if I’d been at work.

The ultimate Heisenbug?

We’ve got a problem that happens apparently at random times at a few customer sites, but which we’ve been unable to reproduce in the lab. I’m not sure if that means it’s a Heisenbug or just a really nasty Bohr-bug.

The part of the system that is affected are three programs:

  • One that generates events, called “tixd
  • One that is responsible for collecting events from all the programs in the system (not just these three) and delivering them to subscribers, called the “EventBroker” or “eb
  • One that subscribes to the events that the “tixd” generates, which we call the “scheduled

What has been happening on these customer sites is that after days or weeks of proper operation, for no apparent reason, the “tixd” would say that it’s generating an event, but the “scheduled” wasn’t getting them any more. The customer would notice the problem, sometimes a day or two later, complain that things weren’t happening that were supposed to happen, our service people would restart the whole system, and everything would start working again.

This bug has been happening for ages now, and every time I get called in to look at their logs because I wrote the “scheduled” and all the fingers point to me. But I couldn’t find any reason why “scheduled” would stop responding to events, or would unsubscribe from events. A few builds ago, Tom put some debug into his “eb” that would log every event that came in and which subscribers it was being delivered to. He also logged subscribes and unsubscribes. And so we waited.

Today, it finally happened again. And this time, I’ve got the logs that show:

  • At 6am, an event is generated by the tixd, and the eb delivers it to the subscriber scheduled
  • Between 10am and 11am, there is a flurry of event subscribes and unsubscribes, all unrelated to scheduled. But some of these unsubscribes are caused when events are being delivered to subscribers that have exited without unsubscribing.
  • At about 1am, there is another event generated by the tixd, and the eb receives it but says there are no subscribers found.

At this point, because the eb log shows no unsubscribe coming from scheduled, I’d say it’s not my bug and pass it off to Tom, the author of the eb. But unfortunately, my employer declined to renew Tom’s contract at the end of last year, so he no longer works here. He dodged this bullet by only 5 days. And so I’ve got to figure out why this is happening. Lucky me.