I’m still working on the problem in Rants and Revelations » That’s a head scratcher.
I wrote the thread spawning test program, and it ran 18,000+ iterations overnight on a test machine without the slightest hesitation. I pored over the code to see if there was a “Dining Philosophers”-style lock contention issue. I examined the logs for other programs on the system. And I’m still no closer.
I have a horrible suspicion that the lock up is actually in the database code somewhere. And also, that instead of using threads and locks to make sure I respond to the events quickly but don’t do more than one event at a time, what I really need is an job queue, so I can monitor if a job is taking too long, just kill it and start the next.
But of course since I don’t know where the lock up is actually happening nor can I reproduce it, I’m not sure how to know if my changes are going to fix anything.
I vote for the job queue. If nothing else, create a thread pool, and feed it from the jobs. Everytime you get a job, check the threads. If one is idle, give them the job. If not, check to see if any are hung, kill it, then restart it and give it the job. And so on…
We’ve found the best way to find lock problems is a multi-processor machine. We’ve got one or two quads that get loaned out just to debug this sort of thing.
What are your database changes like? if you’ve more than one insert or update in a job can they occur in different sequences? e.g. insert from tableA followed by a delete from tableB and another job task requires an insert into tableB followed by an insert into tableA?
What are you using as a database? can you monitor its lock activity?
Richard, that’s a good point. The job that’s actually freezing is just doing queries, no updates, but there are other processes going on. But there are some other processes and one of them does some deletes from the same tables that this one is deleting. Also, there may be a “vacuum analyze;” going on at the same time.
We’re using PostgresSQL (as you might have guessed by the “vacuum analyze” thing). I’ll have to look more closely into what else might be going on in that table, and how to monitor locks. I’m pretty sure we turned off autocommit, but we don’t actually use transactions or explicit commits anywhere. I was thinking of doing so, but I haven’t figured out the best way to do this – whether to make one big transaction around major activities like “the nightly expire run” or “handle this event”, or small transactions around single operations like “delete this playlist and all the schedules that rely on it”.