I’d love to help you, buddy, but I’d prefer to keep my license

I got a curious email just now.

Good Afternoon- I got your name through the Rochester Flying Club website. I live in Hemlock and work in Rochester. I am looking for a
way to be in two places at once…..I have two children graduating from college on the same day in different parts of New York State. My
daughter graduates from [college in upstate NY] at 10:30 in the morning and my son graduates from [another college, near here] at 4pm in the afternoon . There isn’t time to drive between the two, but there might be time to get to them both if we fly. I’m hoping that you or someone you might know would might be interested in flying 4 of us from [first location] to [second location] on the afternoon of May 21st. If you can lend any assistance, I’d love to hear from you.

I feel for the guy, and this sort of need to be in two places at once is a pretty compelling reason to become a pilot. But if I, or any other private pilot, were to take him up on the offer, the FAA would be all over the pilot for offering an illegal charter. Plus there is the little matter that for his four person family, the pilot needs a six seater, and at 4pm on the 21st I’m going to be flying the club’s 6 seater home from the rec.aviation fly-in.

Still scratching my head.

I’m still working on the problem in Rants and Revelations » That’s a head scratcher.

I wrote the thread spawning test program, and it ran 18,000+ iterations overnight on a test machine without the slightest hesitation. I pored over the code to see if there was a “Dining Philosophers”-style lock contention issue. I examined the logs for other programs on the system. And I’m still no closer.

I have a horrible suspicion that the lock up is actually in the database code somewhere. And also, that instead of using threads and locks to make sure I respond to the events quickly but don’t do more than one event at a time, what I really need is an job queue, so I can monitor if a job is taking too long, just kill it and start the next.

But of course since I don’t know where the lock up is actually happening nor can I reproduce it, I’m not sure how to know if my changes are going to fix anything.

That’s a head scratcher

I’m working on the type of bug that might take me a day, it might take me a week, or it might cause me to give up entirely.

In our system, there is a process of mine (the schedule daemon) that gets events from another process (the event broker) and does some database manipulation. Because the events can come thick and fast, and because I don’t want them stepping on each other, each event causes a separate thread to be spawned, and the thread action is guarded by a global “synchronized” object (this is in Java, by the way). Most of the time, this works fine – if an event happens while another thread is still processing, the second one waits for the first one to relinquish the lock, and it does its thing. The event processing threads generally take 5-15 seconds to run.

But I have a log file I from a customer site, where it appears that one of these event threads started at 04:17, and never finished and never relinquished the lock. So events that happened at 04:51, 05:52 and 06:01 never got processed. And I can’t for the life of me figure out why.

I’ve looked extensively at the code between the last progress message from the 04:17 thread and the progress message it should have printed next. Nothing leaps out at me. And like I said, this code works all over the place, even at this customer site most of the time.

One possibility is that some other program is manipulating the database at the time. I do know that the Playlist that was being retrieved at the time of failure is not present in the next day’s backup, so something may have been deleting it at the time.

I wrote a program that calls the same database method as the one that hung over and over again, and ran that in a continuous loop while doing other stuff to the database including deleting the playlist in question. But while I’ve got my test program to fail with an exception numerous times, it never hangs. (I’m assuming that if a thread dies with an exception, it will release its locks. Something else to investigate, I guess.)

I guess my next step is to step up the tree a bit, and instead of calling the low level query multiple times, try spawning the thread that hung multiple times. Other than that, I’m baffled.