I made some changes to my FAA data loader script and ran it. Four days later it had finished running, I discovered a few bugs, and was getting ready to run it again, but I decided to see if I could improve the speed any. I’d already run the perl profiler and discovered that 95% of the time was spent in mysql. So I tried doing an “EXPLAIN” on all the queries. That’s when I discovered that one very common query was doing the dreaded “ALL” query on a 12,000 row table as step 1. Hmmm. That table isn’t even an important one, it was just table joined to the main “waypoint” table to get one field that was semi-useful. The query has a “waypoint.datasource_key = ?” in it, why isn’t it doing that first? “DESCRIBE TABLE waypoint;” showed me the error of my ways – I’d forgotten to put an index on “datasource_key”.
So I created the index and started the script before going to bed. I was astonished to discover that the script was done this morning. According to my Munin graphs, it had only taken about 6 hours to run. 96 hours down to 6 hours. Yeah, that’s a worthwhile optimization.
Yup, databases are one area where algorithmic complexity is tangible and can give several orders of magnitude of performance differences.
I had a couple of 13 million row tables that I was working with a few years ago and found I had missed an index. Processing went from a solid 20-30 hours on a fairly beefy multi-processor machine to 10 minutes. Basically it stopped doing full table scans on every DML.