Still scratching my head.

I’m still working on the problem in Rants and Revelations » That’s a head scratcher.

I wrote the thread spawning test program, and it ran 18,000+ iterations overnight on a test machine without the slightest hesitation. I pored over the code to see if there was a “Dining Philosophers”-style lock contention issue. I examined the logs for other programs on the system. And I’m still no closer.

I have a horrible suspicion that the lock up is actually in the database code somewhere. And also, that instead of using threads and locks to make sure I respond to the events quickly but don’t do more than one event at a time, what I really need is an job queue, so I can monitor if a job is taking too long, just kill it and start the next.

But of course since I don’t know where the lock up is actually happening nor can I reproduce it, I’m not sure how to know if my changes are going to fix anything.

That’s a head scratcher

I’m working on the type of bug that might take me a day, it might take me a week, or it might cause me to give up entirely.

In our system, there is a process of mine (the schedule daemon) that gets events from another process (the event broker) and does some database manipulation. Because the events can come thick and fast, and because I don’t want them stepping on each other, each event causes a separate thread to be spawned, and the thread action is guarded by a global “synchronized” object (this is in Java, by the way). Most of the time, this works fine – if an event happens while another thread is still processing, the second one waits for the first one to relinquish the lock, and it does its thing. The event processing threads generally take 5-15 seconds to run.

But I have a log file I from a customer site, where it appears that one of these event threads started at 04:17, and never finished and never relinquished the lock. So events that happened at 04:51, 05:52 and 06:01 never got processed. And I can’t for the life of me figure out why.

I’ve looked extensively at the code between the last progress message from the 04:17 thread and the progress message it should have printed next. Nothing leaps out at me. And like I said, this code works all over the place, even at this customer site most of the time.

One possibility is that some other program is manipulating the database at the time. I do know that the Playlist that was being retrieved at the time of failure is not present in the next day’s backup, so something may have been deleting it at the time.

I wrote a program that calls the same database method as the one that hung over and over again, and ran that in a continuous loop while doing other stuff to the database including deleting the playlist in question. But while I’ve got my test program to fail with an exception numerous times, it never hangs. (I’m assuming that if a thread dies with an exception, it will release its locks. Something else to investigate, I guess.)

I guess my next step is to step up the tree a bit, and instead of calling the low level query multiple times, try spawning the thread that hung multiple times. Other than that, I’m baffled.

Stupid TiVo, stupid Netgear

Our two TiVos are networked together over our wireless network. This allows us to transfer shows between the two, as well as doing the daily call over the internet. But the wireless dongle on the upstairs TiVo occasionally drops connection. In the past, I’ve just unplugged the dongle and let it cool down a bit, plug it back in, and everything was fine. But twice in the last two days, when I unplug the dongle, the whole TiVo reboots. That’s not good.

I wonder if it’s time to buy a new dongle. Or maybe I should just hold out for the TiVo Series III.