A week or two ago, somebody from customer support came to me because certain customer sites were having some weird problems – some sites weren’t seeing new content even though it had been delivered and couldn’t update their schedules. She reported that restarting our services fixed it on most sites, but on one she’d had to reboot. Unfortunately the wasn’t anything obvious in the logs. The weird thing is that she said that the problem started with our content provider moved to the new servers – but the content provider says that they didn’t change what they were sending, just what servers they were sending it from.
After a few days, she managed to find one site that still having the problem, and I poked around and still couldn’t find much, except for the fact that if I went into the postgres command line, psql, it would allow me to do anything on the database except query one particular table. If I tried to do anything on that table, it would freeze up. Hmmm. Lacking any other ideas, I shut down the database server and restarted it, and that cleared up the problem. But shutting down the database server also kills off any processes that might be using the database. I was starting to think that two processes were in a deadlock over this one particular table. I filed away the information and asked for her to call me if it happened on any other sites.
This morning, she comes and says it’s happened on several sites at once. She logged me into one of the sites, and sure enough psql would block if I tried to select from that same damn table. Time to dig a little deeper. “select * from pg_locks” showed a couple of exclusive locks. Hmmm. Doing a “ps auwwfx” showed that there was a vacuumdb going on. Oh oh. It’s the nightly backup scripts. A couple of years ago (21Sep2005), I threw a call to “vacuumdb –analyze –full” in there. A bit of googling showed that duh, you’re not supposed to do a “–full” when anything else is going on because it does an exclusive lock on full tables. And JDBC has to take some sort of weird work-around for the fact that Postgres doesn’t give them an easy way to turn autocommit on, so they do the equivalent of a “commit;begin;” after every command. And this causes some locking of its own to go on which is clashing with the vacuumdb locking.
In all this time, it’s never caused any problems, or at least none with any regularity, but evidently the server relocation has caused some content ingestion to happen at the exact time this vacuumdb is going on, and causing these particular sites to have semi-regular problems.
I told them to go around to all the customer sites and edit /etc/init.d/backup_cos_files and remove the “–full” from the vacuumdb command line, and all should be well. I can’t believe I made such a dumb mistake, and that it didn’t cause any problems until now. Actually, I’m sort of hoping it will solve my other mysterious database lock up that was only happening about once a year per site. But that’s probably too much to hope for.