I’m working on the type of bug that might take me a day, it might take me a week, or it might cause me to give up entirely.
In our system, there is a process of mine (the schedule daemon) that gets events from another process (the event broker) and does some database manipulation. Because the events can come thick and fast, and because I don’t want them stepping on each other, each event causes a separate thread to be spawned, and the thread action is guarded by a global “synchronized” object (this is in Java, by the way). Most of the time, this works fine – if an event happens while another thread is still processing, the second one waits for the first one to relinquish the lock, and it does its thing. The event processing threads generally take 5-15 seconds to run.
But I have a log file I from a customer site, where it appears that one of these event threads started at 04:17, and never finished and never relinquished the lock. So events that happened at 04:51, 05:52 and 06:01 never got processed. And I can’t for the life of me figure out why.
I’ve looked extensively at the code between the last progress message from the 04:17 thread and the progress message it should have printed next. Nothing leaps out at me. And like I said, this code works all over the place, even at this customer site most of the time.
One possibility is that some other program is manipulating the database at the time. I do know that the Playlist that was being retrieved at the time of failure is not present in the next day’s backup, so something may have been deleting it at the time.
I wrote a program that calls the same database method as the one that hung over and over again, and ran that in a continuous loop while doing other stuff to the database including deleting the playlist in question. But while I’ve got my test program to fail with an exception numerous times, it never hangs. (I’m assuming that if a thread dies with an exception, it will release its locks. Something else to investigate, I guess.)
I guess my next step is to step up the tree a bit, and instead of calling the low level query multiple times, try spawning the thread that hung multiple times. Other than that, I’m baffled.
I feel your pain Paul, I’ve hit the “that’s it, I’m giving up programming for ever” point so many times… And threads were usually involved!
No idea how debugging Java compares to debugging Perl, but I hope for your sanity’s sake you suss it out swiftly, or at least more swiftly than I usually do!
Damn hot here by the way. Should have packed some English weather…
Barry, it’s hot in Thailand? Can I be the first to say “no shit, Sherlock”?
I could never live in such a climate.
Sorry, obviously I knew it would be hot, but hadn’t quite mentally prepared myself for the intensity. Probably the same short-sightedness that plagues me with debugging!
I asked a Canadian friend who’s been here 8 years how long it takes to acclimatise. When she’d finished laughing, she assured me that I will never, ever do so… Still, it’s due to rain sometime in June.
Drifting back on topic, you assumed that if a thread dies with an exception, it will release its locks. Usually it’s my assumptions that are at fault. Not suggesting yours is, but it should be simple enough to rule it out…
No, I checked the documentation and synchronized(Object) locks are supposed to be released as soon as control goes out of scope, especially with exceptions.
Right now I’m running a test with a program that continually spawns these threads, and another program that deletes the playlists that the thread is loading. So far, after 700+ iterations, no dice.