More on this data loader program

Well, I profiled a smaller data set and found a place where I was wasting a significant amount of time while processing nodes that I don’t care about. I’ve modified the code and I stopped the perl program (after 6098 minutes elapsed, 3491 minutes user, 2584 minutes system) and I’ve re-run it, and it finished in 16 minutes 30 seconds elapsed, 16 minutes 10 seconds user, 10 seconds system. Meanwhile, I’ve written a Java program that does the same stuff that the perl program does (like I said in my previous post, the perl program doesn’t actually do any loading or anything useful, it just parses one of the types of nodes that I’m interested in and prints out what it’s found) and it ran the whole file in 17 minutes 38 seconds elapsed, 6 minutes 31 seconds user and 10 minutes 9 seconds system.

So the upshot of this is that I guess I’m going to stick to perl.

I may need to rethink this…

I am currently working on a new data source for the waypoint generator. Unfortunately because of the way it’s licensed, it’s only going to be for the iPhone version of CoPilot, and I can’t make it available for GPX and other users. Now all of my data loaders have, up until now, been written in Perl, and I have a really good Perl module that performs many of the loading tasks, such as merging existing data with new data.

The new data comes in the form of a gigantic XML file with a kind of weird schema. The provider actually provides both the gigantic file, and also a smaller set of updates on the 28 day cycle favoured by the ICAO, so hopefully I’ll only have to parse the gigantic file once, and then process the updates. I installed XML::SAX and Expat, and coded up a preliminary decoder to extract some (but not all) of the information that I need, just to make sure I was doing it right. I ran it with a subset of the data, and it seemed to be doing ok, and then just for grins while I was working on improving the code, I fired it off on the whole file. That was 3 days (72 hours) ago. It’s still running. Unfortunately I didn’t put in any progress messages so I don’t know where it is in file, only that it’s past the airport section that I care about. I profiled the subset data, and verified that Perl is spending most of its time in Perl code, not in native code – some of it mine, some of it XML::SAX, and some of it in Moose.

So here’s the conundrum: Do I spend the time to re-write this loader code in another language and hope it’s faster? Or do I accept the fact that this is going to take forever, but hopefully I’ll only have to do it once and then the updates will be small enough that I can do them in perl? Because re-writing in another language means re-writing all the data merging and validation logic code, and could be a potentially huge project. And I won’t know until it’s all working whether it’s going to be faster.

Update: I profiled the perl program with a semi-large dataset. Here’s the results:

dprofpp
Total Elapsed Time = 56.86461 Seconds
User+System Time = 46.10461 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
20.5 9.494 23.288 397862 0.0000 0.0001 XML::SAX::Expat::_handle_start
15.4 7.136 12.820 131698 0.0000 0.0000 XML::SAX::Expat::_handle_char
14.7 6.787 55.922 1 6.7867 55.921 XML::Parser::Expat::ParseStream
13.6 6.311 12.977 397862 0.0000 0.0000 XML::SAX::Expat::_handle_end
7.07 3.258 3.258 472462 0.0000 0.0000 XML::NamespaceSupport::_get_ns_det
ails
6.79 3.132 3.132 397862 0.0000 0.0000 XML::NamespaceSupport::push_contex
t
6.48 2.986 5.685 131698 0.0000 0.0000 XML::SAX::Base::characters
4.24 1.953 1.953 131698 0.0000 0.0000 EADHandler::characters
3.87 1.786 4.411 397862 0.0000 0.0000 EADHandler::start_element
3.78 1.744 12.308 211270 0.0000 0.0000 XML::SAX::Base::__ANON__
3.69 1.702 1.838 4000 0.0004 0.0005 Data::Dumper::Dumpxs
2.55 1.174 5.870 397862 0.0000 0.0000 XML::SAX::Base::start_element
2.44 1.124 3.956 397862 0.0000 0.0000 XML::NamespaceSupport::process_ele
ment_name
1.93 0.892 0.892 397862 0.0000 0.0000 XML::NamespaceSupport::pop_context
1.85 0.854 5.768 397862 0.0000 0.0000 XML::SAX::Base::end_element

Note how it’s dominated by XML::SAX::Expat.

Cool surfer d00d

Today Mike, Ken and I met at the Irondequoit Bay inlet for some paddling in the surf. The air was fairly cold, in the low 50s, and the water was in the low 60s, and I was wearing my farmer john and my Hydroskin shirt, so I was pretty well prepared for it. Unfortunately when we got there what we found were huge crashing waves, and howling winds. The bay side wasn’t so bad, but with the wind howling I knew that once we got a little bit off shore the waves would kick up there too. We decided to “play” a bit in the channel.

When we paddled into the channel, what we found were huge crashing waves at the top end of the channel, and beyond them what looked like a solid wall of water about 5 feet high. Honestly, you didn’t get any sense that there were waves out there, just that the lake level was 5 feet higher than the water level in the channel. Mike and I decided to just do runs up and down the relatively calmer part of the channel, getting into waves that were probably only a foot or so high, but Ken was into it and we could see him flying around in that maelstrom. I was sure his boat was continuing towards shore without him a few times, but as it got closer you’d see it get under control and realize he was still in it.

I should mention that the “relatively calm” part of the channel was highly variable – every now and then a set of white caps would come roaring down about half way down the length of the channel, especially on some shallower water on the east side, but much of the time you could get about two thirds of the way up without any difficulty.

I ended up dumping three times in the 30 minutes I was out there. The first time came when Ken asked to dock up with me so he could attach his paddle leash, and by the time he was done we’d drifted sideways to the waves and wind, so almost as soon as I was clear of him I dumped. The second came as I was getting back in after that dump. I normally don’t count the secondary dumping while getting in because it happens so often.

The third time I had drifted a little too far upwind as I was waiting for a lull in the waves, and when I went to turn the wind caught me hard and threw me over. After that time I decided to call it a night before I started to get cold. I was well dressed for it and I hadn’t felt cold while in the water, but I figure it’s better to quit while you’re ahead.

I intend to do some more of this kind of wave work. I think it’s really good for me, and eventually I’ll probably come to enjoy it. However, I need to make two small adjustments – I need to put on my big rudder on the ski so I have a bit more control, and I need to remember to bring a towel and some dry clothes for after.