Quick update on a busy week

Since my last major blog post (not counting the 4am rant about coding) I’ve

  • had a long team work-out in bay, doing 8 intervals of 1/4 mile at high speed.
  • spent a day in our new office – it’s kind of tiny and the chair isn’t very comfortable, so I haven’t been back since, but I’ll probably start going now that Vicki is back at work full time because we can eat lunch together at RIT
  • did a “short” paddle of 6 miles alone on the river. Man, I remember when 6 miles was a major workout. It seems like it was only a year ago, mostly because it was.
  • Did a long paddle of 10 miles with Mike and Paul D on the Bay. I tried to ride wash most of the way, and I was still wiped at the end.
  • Went to the LUGOR picnic and talked to a guy about his project to use NNTP over mesh networks to make an extremely distributed discussion network.
  • Took my citizenship interview and test, and passed. My swearing in ceremony is in two weeks. Doesn’t give me much time to get that Canadian flag tatoo.
  • Drove up to Oshawa to pick up my dad’s old table saw and drill press – he’s moving into a smaller house, and had no room for it, so I said I’d take it. Don’t know what I’ll do with it, but I’d love to get better at wood shop type stuff.
  • Arrived late for the team work out so only did one interval. I wasn’t properly warmed up at that point, and it didn’t go well.
  • Did the equivalent workout the next day, doing 5 sets of 1000 metre (0.62 mile) at around 7mph.
  • Found out that a friend of Dan’s who was at the team work out is giving away a racing kayak, a West Side Boat Shop Thunderbolt, because it was set up for a 250 pound paddler and he couldn’t get it adjusted for himself. This boat is longer, narrower and lighter than my existing boat and a lot faster and tippier. I’m looking forward to trying it out, but I doubt I’ll be able to race it this year.
  • Make slow progress on the contract job I’m working on. Yes, I’m late getting it done, but I think it’s going to go faster now that I’ve stage 1 done.

I’ve got coding, running around my brain

One of the problems I’ve suffered from all my working life is an inability to sleep when something is bugging me about the program I’m working on. Currently, it’s 3:56 am and I’m at my computer because I was tossing and turning thinking of various things I had to try to figure out what’s going wrong, and so I had to get up to try them. Unfortunately, those things didn’t work, so I had to try other things, and here it is 2 hours later and I’m not closer to fixing the original problem, and no closer to going to sleep.

I’d say this inability to shut out a problem and go to sleep was a major problem, but by the same token I like to tell myself that it’s this single minded determination to get things right that makes me so good at programming, so I guess I have to take the one with the other.

And now it’s 4:03, and my latest test is getting
fetch of http://localhost/Documents/pharma/DocSamples/CHINA.doc failed with: java.lang.NoSuchMethodError: org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;
so it looks like sleep isn’t any closer.

This is worrisome.

Update: Somebody on the Nutch mailing list pointed me towards the config option “fetcher.threads.per.host”. Increasing that to 10 dropped the time from 45 minutes to 15 minutes on the first crawl and 2 minutes for a re-crawl. Since I fixed Nutch to properly respect the Last-Modified header and If-Modified-Since, I don’t think I’m going to be blocked from crawling sites with multiple threads. Much less worrisome.

Time spent to copy all the files on three small web sites to a directory on my machine using wget: 1 minute 1.114 seconds.

Time spent for Nutch to re-crawl those same web sites: 45 minutes.

It doesn’t seem to matter what I put in the “number of threads” parameter to Nutch, either – it takes 45 minutes if I give it 10 threads or 125 threads.

Even worse for Nutch, out of the box it refetches documents even if they haven’t changed – I had to find and fix a bug to make that part work – but wget does the right thing.

Considering that all I’m doing with the Nutch crawl is going through the returned files one by one and doing some analysis and putting those results in a Solr index, I wonder if I should toss Nutch entirely and just work up something using wget? All I’m really getting out of Nutch is pre-parsing the html to extract some meta data.

Too bad I’ve already spent 3 weeks on this contract going down the Nutch road. At this point, it would be too time consuming to throw away everything I have and start afresh.

Long paddle today

[youtube YE17iVskad4 Team Practice]
Today instead of just me and Mike doing a long grind, it was five of us – Mike, Paul D, Bill, me, and coach Dan. We met at the lake, and the lake was flatter than a pancake. Even I, the big wuss, paddled with the PFD lashed to the rear deck instead of wearing it, although Paul D wore his, but I think that was more due to his lack of experience and comfort in the ski rather than waves or wind.

We started off doing a moderate pace, riding each other’s wash, and every mile doing a “pickup” or a faster piece, not a sprint, but faster than our “grind” pace. At other times, instead of doing our “pickup” at a given time, we sprinted across a river channel, or turned to ride a large wake coming in. We did two level pickups where we increased pace to something like 6.5 mph, and then after 45 seconds increased to 6.7 or 6.8 for another 45 seconds. It was a good work out, lots of variation, and I’m quite wiped right now.

It’s an awesome sight seeing those four gleaming white surf skis skimming along the water, and my boat is also pretty gleaming itself, although it looks a little out of place. Based on my brief experience with the V10 Sport at Baycreek, I figure I’m half a mile an hour slower in my boat, so I think I’m doing pretty damn well to keep up with these guys for 2 hours.