I may need to rethink this… – Rants and Revelations

I am currently working on a new data source for the waypoint generator. Unfortunately because of the way it’s licensed, it’s only going to be for the iPhone version of CoPilot, and I can’t make it available for GPX and other users. Now all of my data loaders have, up until now, been written in Perl, and I have a really good Perl module that performs many of the loading tasks, such as merging existing data with new data.

The new data comes in the form of a gigantic XML file with a kind of weird schema. The provider actually provides both the gigantic file, and also a smaller set of updates on the 28 day cycle favoured by the ICAO, so hopefully I’ll only have to parse the gigantic file once, and then process the updates. I installed XML::SAX and Expat, and coded up a preliminary decoder to extract some (but not all) of the information that I need, just to make sure I was doing it right. I ran it with a subset of the data, and it seemed to be doing ok, and then just for grins while I was working on improving the code, I fired it off on the whole file. That was 3 days (72 hours) ago. It’s still running. Unfortunately I didn’t put in any progress messages so I don’t know where it is in file, only that it’s past the airport section that I care about. I profiled the subset data, and verified that Perl is spending most of its time in Perl code, not in native code – some of it mine, some of it XML::SAX, and some of it in Moose.

So here’s the conundrum: Do I spend the time to re-write this loader code in another language and hope it’s faster? Or do I accept the fact that this is going to take forever, but hopefully I’ll only have to do it once and then the updates will be small enough that I can do them in perl? Because re-writing in another language means re-writing all the data merging and validation logic code, and could be a potentially huge project. And I won’t know until it’s all working whether it’s going to be faster.

Update: I profiled the perl program with a semi-large dataset. Here’s the results:
dprofpp Total Elapsed Time = 56.86461 Seconds User+System Time = 46.10461 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 20.5 9.494 23.288 397862 0.0000 0.0001 XML::SAX::Expat::_handle_start 15.4 7.136 12.820 131698 0.0000 0.0000 XML::SAX::Expat::_handle_char 14.7 6.787 55.922 1 6.7867 55.921 XML::Parser::Expat::ParseStream 13.6 6.311 12.977 397862 0.0000 0.0000 XML::SAX::Expat::_handle_end 7.07 3.258 3.258 472462 0.0000 0.0000 XML::NamespaceSupport::_get_ns_det ails 6.79 3.132 3.132 397862 0.0000 0.0000 XML::NamespaceSupport::push_contex t 6.48 2.986 5.685 131698 0.0000 0.0000 XML::SAX::Base::characters 4.24 1.953 1.953 131698 0.0000 0.0000 EADHandler::characters 3.87 1.786 4.411 397862 0.0000 0.0000 EADHandler::start_element 3.78 1.744 12.308 211270 0.0000 0.0000 XML::SAX::Base::__ANON__ 3.69 1.702 1.838 4000 0.0004 0.0005 Data::Dumper::Dumpxs 2.55 1.174 5.870 397862 0.0000 0.0000 XML::SAX::Base::start_element 2.44 1.124 3.956 397862 0.0000 0.0000 XML::NamespaceSupport::process_ele ment_name 1.93 0.892 0.892 397862 0.0000 0.0000 XML::NamespaceSupport::pop_context 1.85 0.854 5.768 397862 0.0000 0.0000 XML::SAX::Base::end_element
Note how it’s dominated by XML::SAX::Expat.

4 thoughts on “I may need to rethink this…”

Totty says:

October 16, 2010 at 3:37 am

Looks like a standard troubleshooting issue too me. First, do you have enough main memory? Than check disks. I don’t do much database work recently, but I find processors with multiple core to be overrated, especially if your code cannot work with more than 1 logical processor at a time. I don’t have time to look into that, I have to do more theoretical compsci up for the next year or so.
Paul Tomblin says:

October 16, 2010 at 7:44 am

It’s not using massive amounts of memory and it’s not swapping. It’s not even using a lot of disk i/o. It’s completely CPU bound, and it only uses one core.
Rohan Khaleel says:

October 16, 2010 at 8:48 am

Since a rewrite is going to be a massive undertaking, I would seek better information first. I would instrument the decoder to try to pinpoint the compute bottleneck, and re-run it on the large file. This may point to specific area that needs optimization.

Whenever I am forced to do this exercise, I am frequently surprised by the root cause of the bottleneck. Sometimes the fix is trivial, and sometimes it is massive…
Paul Tomblin says:

October 16, 2010 at 9:03 am

Unfortunately, the profiler output from a semi-large file isn’t all that conclusive. It looks like the bottlenecks (and there are a few) are in standard perl code.

Comments are closed.