I am currently working on a new data source for the waypoint generator. Unfortunately because of the way it’s licensed, it’s only going to be for the iPhone version of CoPilot, and I can’t make it available for GPX and other users. Now all of my data loaders have, up until now, been written in Perl, and I have a really good Perl module that performs many of the loading tasks, such as merging existing data with new data.
The new data comes in the form of a gigantic XML file with a kind of weird schema. The provider actually provides both the gigantic file, and also a smaller set of updates on the 28 day cycle favoured by the ICAO, so hopefully I’ll only have to parse the gigantic file once, and then process the updates. I installed XML::SAX and Expat, and coded up a preliminary decoder to extract some (but not all) of the information that I need, just to make sure I was doing it right. I ran it with a subset of the data, and it seemed to be doing ok, and then just for grins while I was working on improving the code, I fired it off on the whole file. That was 3 days (72 hours) ago. It’s still running. Unfortunately I didn’t put in any progress messages so I don’t know where it is in file, only that it’s past the airport section that I care about. I profiled the subset data, and verified that Perl is spending most of its time in Perl code, not in native code – some of it mine, some of it XML::SAX, and some of it in Moose.
So here’s the conundrum: Do I spend the time to re-write this loader code in another language and hope it’s faster? Or do I accept the fact that this is going to take forever, but hopefully I’ll only have to do it once and then the updates will be small enough that I can do them in perl? Because re-writing in another language means re-writing all the data merging and validation logic code, and could be a potentially huge project. And I won’t know until it’s all working whether it’s going to be faster.
Update: I profiled the perl program with a semi-large dataset. Here’s the results:
Total Elapsed Time = 56.86461 Seconds
User+System Time = 46.10461 Seconds
%Time ExclSec CumulS #Calls sec/call Csec/c Name
20.5 9.494 23.288 397862 0.0000 0.0001 XML::SAX::Expat::_handle_start
15.4 7.136 12.820 131698 0.0000 0.0000 XML::SAX::Expat::_handle_char
14.7 6.787 55.922 1 6.7867 55.921 XML::Parser::Expat::ParseStream
13.6 6.311 12.977 397862 0.0000 0.0000 XML::SAX::Expat::_handle_end
7.07 3.258 3.258 472462 0.0000 0.0000 XML::NamespaceSupport::_get_ns_det
6.79 3.132 3.132 397862 0.0000 0.0000 XML::NamespaceSupport::push_contex
6.48 2.986 5.685 131698 0.0000 0.0000 XML::SAX::Base::characters
4.24 1.953 1.953 131698 0.0000 0.0000 EADHandler::characters
3.87 1.786 4.411 397862 0.0000 0.0000 EADHandler::start_element
3.78 1.744 12.308 211270 0.0000 0.0000 XML::SAX::Base::__ANON__
3.69 1.702 1.838 4000 0.0004 0.0005 Data::Dumper::Dumpxs
2.55 1.174 5.870 397862 0.0000 0.0000 XML::SAX::Base::start_element
2.44 1.124 3.956 397862 0.0000 0.0000 XML::NamespaceSupport::process_ele
1.93 0.892 0.892 397862 0.0000 0.0000 XML::NamespaceSupport::pop_context
1.85 0.854 5.768 397862 0.0000 0.0000 XML::SAX::Base::end_element
Note how it’s dominated by XML::SAX::Expat.