Geocoding is hard…

One of the problems I’m having with this data load is that instead of telling you what country each waypoint is in, they tell you the “responsible authority”. Ok, normally that’s not too hard to map to a country, and sometimes there are multiple authorities for a country, (and the Czech Republic is super annoying because they designate every little flying club or airport owner as a “responsible authority”). That I can take care of with a simple lookup table – 305 entries, 90 of them in the Czech Republic. The problem occurs because sometimes the “responsible authority” covers multiple countries. “Serbia/Montenegro” in the Balkans, “Comoros/Madagascar/Reunion” in the Indian Ocean, Aruba/Netherlands Antilles” in the Caribbean, “Kiribati/Tuvala”, “Kiribati/Line Islands”, “American Samoa/Western Samoa” in the Pacific. (Although didn’t I read somewhere that the Netherlands Antilles recently split up into a bunch of separate countries?) Anyway, I want to disambiguate these and determine which country points in these merged authorities are in.

First I thought I’d look for the closest point in my existing database. Turns out, some of the new points are near borders so I end up getting the wrong country. Aha, I thought, I’ll use “Reverse Geocoding”. A while back I used a service at geonames.org to reverse geocode some points to determine which Canadian province they were in. I tried it, and the service is really slow to respond. So I thought I’d try Google’s new reverse geocoding. That’s when I discovered a couple of flies in my oatmeal:

  1. There are locations in the world where Google returns no results, in one case I saw because the point is slightly off shore according to Google Maps (although if you switch to satellite view you can see the point is actually on land). In another case, the result is puzzling – yes, it’s in Kosovo so maybe it’s disputed territory, but it’s not too far from the village of Lluge which Google does recognize.
  2. Addresses in Kosovo show up in the “formatted_address” field as “Lluge, Kosovo”, but the country code that is returned is Serbia. The data I’ve used before comes from the US government, and since the US government officially recognizes Kosovo, it would be inconsistent to label the new stuff as from Serbia instead of Kosovo

Oh, and geonames.org? It eventually seems to do the right thing for both of the above cases, although the country code it returns for Kosovo is “XK” (it appears that there isn’t an official ISO country code for Kosovo – I’d previously seen “KS”. I guess I’ll have to experiment more.

Consider those goals met

On October 13th last year, I posted about my goals for this year, and beyond. In that post, I expressed the goal of doing 650-700 miles of paddling this year. I just checked with Garmin Connect, and it shows that since January 1st I’ve paddled 759.25 miles, including 76.17 miles of races. That does not include a few workouts here and there where I forgot my GPS, or a short gap where my GPS stopped uploading to the computer and I had to buy another one. If I do the “Last 365 days” instead of “Since January 1”, that ups my total to 945.8 miles. I’d say that constituted a pretty decent base.

I also said I’d like to join a pit crew to see what it’s like at the Adirondack Canoe Classic (aka “The 90 Miler”). That I did, and I helped out Sue and Liz as they took care of Doug and Mike at the 90. Granted, I didn’t go to every pit stop, mostly because I was trying to get a decent paddle in each day myself so I could see what it was like, but I was there at the finish to help tired paddlers out of their boats and take care of their boats for them. And in spite of seeing these guys staggeringly tired and bloody and nearly puking, I’m sure that I want to try it next year. I just hope my knees can stand up to portaging.

More on this data loader program

Well, I profiled a smaller data set and found a place where I was wasting a significant amount of time while processing nodes that I don’t care about. I’ve modified the code and I stopped the perl program (after 6098 minutes elapsed, 3491 minutes user, 2584 minutes system) and I’ve re-run it, and it finished in 16 minutes 30 seconds elapsed, 16 minutes 10 seconds user, 10 seconds system. Meanwhile, I’ve written a Java program that does the same stuff that the perl program does (like I said in my previous post, the perl program doesn’t actually do any loading or anything useful, it just parses one of the types of nodes that I’m interested in and prints out what it’s found) and it ran the whole file in 17 minutes 38 seconds elapsed, 6 minutes 31 seconds user and 10 minutes 9 seconds system.

So the upshot of this is that I guess I’m going to stick to perl.

I may need to rethink this…

I am currently working on a new data source for the waypoint generator. Unfortunately because of the way it’s licensed, it’s only going to be for the iPhone version of CoPilot, and I can’t make it available for GPX and other users. Now all of my data loaders have, up until now, been written in Perl, and I have a really good Perl module that performs many of the loading tasks, such as merging existing data with new data.

The new data comes in the form of a gigantic XML file with a kind of weird schema. The provider actually provides both the gigantic file, and also a smaller set of updates on the 28 day cycle favoured by the ICAO, so hopefully I’ll only have to parse the gigantic file once, and then process the updates. I installed XML::SAX and Expat, and coded up a preliminary decoder to extract some (but not all) of the information that I need, just to make sure I was doing it right. I ran it with a subset of the data, and it seemed to be doing ok, and then just for grins while I was working on improving the code, I fired it off on the whole file. That was 3 days (72 hours) ago. It’s still running. Unfortunately I didn’t put in any progress messages so I don’t know where it is in file, only that it’s past the airport section that I care about. I profiled the subset data, and verified that Perl is spending most of its time in Perl code, not in native code – some of it mine, some of it XML::SAX, and some of it in Moose.

So here’s the conundrum: Do I spend the time to re-write this loader code in another language and hope it’s faster? Or do I accept the fact that this is going to take forever, but hopefully I’ll only have to do it once and then the updates will be small enough that I can do them in perl? Because re-writing in another language means re-writing all the data merging and validation logic code, and could be a potentially huge project. And I won’t know until it’s all working whether it’s going to be faster.

Update: I profiled the perl program with a semi-large dataset. Here’s the results:

dprofpp
Total Elapsed Time = 56.86461 Seconds
User+System Time = 46.10461 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
20.5 9.494 23.288 397862 0.0000 0.0001 XML::SAX::Expat::_handle_start
15.4 7.136 12.820 131698 0.0000 0.0000 XML::SAX::Expat::_handle_char
14.7 6.787 55.922 1 6.7867 55.921 XML::Parser::Expat::ParseStream
13.6 6.311 12.977 397862 0.0000 0.0000 XML::SAX::Expat::_handle_end
7.07 3.258 3.258 472462 0.0000 0.0000 XML::NamespaceSupport::_get_ns_det
ails
6.79 3.132 3.132 397862 0.0000 0.0000 XML::NamespaceSupport::push_contex
t
6.48 2.986 5.685 131698 0.0000 0.0000 XML::SAX::Base::characters
4.24 1.953 1.953 131698 0.0000 0.0000 EADHandler::characters
3.87 1.786 4.411 397862 0.0000 0.0000 EADHandler::start_element
3.78 1.744 12.308 211270 0.0000 0.0000 XML::SAX::Base::__ANON__
3.69 1.702 1.838 4000 0.0004 0.0005 Data::Dumper::Dumpxs
2.55 1.174 5.870 397862 0.0000 0.0000 XML::SAX::Base::start_element
2.44 1.124 3.956 397862 0.0000 0.0000 XML::NamespaceSupport::process_ele
ment_name
1.93 0.892 0.892 397862 0.0000 0.0000 XML::NamespaceSupport::pop_context
1.85 0.854 5.768 397862 0.0000 0.0000 XML::SAX::Base::end_element

Note how it’s dominated by XML::SAX::Expat.

Cool surfer d00d

Today Mike, Ken and I met at the Irondequoit Bay inlet for some paddling in the surf. The air was fairly cold, in the low 50s, and the water was in the low 60s, and I was wearing my farmer john and my Hydroskin shirt, so I was pretty well prepared for it. Unfortunately when we got there what we found were huge crashing waves, and howling winds. The bay side wasn’t so bad, but with the wind howling I knew that once we got a little bit off shore the waves would kick up there too. We decided to “play” a bit in the channel.

When we paddled into the channel, what we found were huge crashing waves at the top end of the channel, and beyond them what looked like a solid wall of water about 5 feet high. Honestly, you didn’t get any sense that there were waves out there, just that the lake level was 5 feet higher than the water level in the channel. Mike and I decided to just do runs up and down the relatively calmer part of the channel, getting into waves that were probably only a foot or so high, but Ken was into it and we could see him flying around in that maelstrom. I was sure his boat was continuing towards shore without him a few times, but as it got closer you’d see it get under control and realize he was still in it.

I should mention that the “relatively calm” part of the channel was highly variable – every now and then a set of white caps would come roaring down about half way down the length of the channel, especially on some shallower water on the east side, but much of the time you could get about two thirds of the way up without any difficulty.

I ended up dumping three times in the 30 minutes I was out there. The first time came when Ken asked to dock up with me so he could attach his paddle leash, and by the time he was done we’d drifted sideways to the waves and wind, so almost as soon as I was clear of him I dumped. The second came as I was getting back in after that dump. I normally don’t count the secondary dumping while getting in because it happens so often.

The third time I had drifted a little too far upwind as I was waiting for a lull in the waves, and when I went to turn the wind caught me hard and threw me over. After that time I decided to call it a night before I started to get cold. I was well dressed for it and I hadn’t felt cold while in the water, but I figure it’s better to quit while you’re ahead.

I intend to do some more of this kind of wave work. I think it’s really good for me, and eventually I’ll probably come to enjoy it. However, I need to make two small adjustments – I need to put on my big rudder on the ski so I have a bit more control, and I need to remember to bring a towel and some dry clothes for after.