This is worrisome.

Update: Somebody on the Nutch mailing list pointed me towards the config option “fetcher.threads.per.host”. Increasing that to 10 dropped the time from 45 minutes to 15 minutes on the first crawl and 2 minutes for a re-crawl. Since I fixed Nutch to properly respect the Last-Modified header and If-Modified-Since, I don’t think I’m going to be blocked from crawling sites with multiple threads. Much less worrisome.

Time spent to copy all the files on three small web sites to a directory on my machine using wget: 1 minute 1.114 seconds.

Time spent for Nutch to re-crawl those same web sites: 45 minutes.

It doesn’t seem to matter what I put in the “number of threads” parameter to Nutch, either – it takes 45 minutes if I give it 10 threads or 125 threads.

Even worse for Nutch, out of the box it refetches documents even if they haven’t changed – I had to find and fix a bug to make that part work – but wget does the right thing.

Considering that all I’m doing with the Nutch crawl is going through the returned files one by one and doing some analysis and putting those results in a Solr index, I wonder if I should toss Nutch entirely and just work up something using wget? All I’m really getting out of Nutch is pre-parsing the html to extract some meta data.

Too bad I’ve already spent 3 weeks on this contract going down the Nutch road. At this point, it would be too time consuming to throw away everything I have and start afresh.

Six hours with a Palm Pre

Vicki and I have been discussing smart phones for a while now. I wanted an iPhone, for a number of reasons regarding the phone itself and the Apps Store, and also because I have severe reservations about Sprint’s ability to provide signal, based on my experience about eight years ago when I was a Sprint customer. But Vicki utterly hated the idea of talking into a flat panel for reasons I don’t entirely understand, and she seemed to feel much more strongly about it than I did. So we decided to go with the Palm Pre. We picked ours up today. Here’s a few preliminary impressions, in no particular order:

  • I find the keyboard very cramped. The Treo keyboard was better.
  • The screen is small compared to the iPhone/Touch but just as bright and readable.
  • There are almost no apps in the App Store.
  • As I feared from my previous experience as a Sprint customer, signal strength inside the house sucks.
  • The OS is very slick in many ways. I’m hoping there is a faster way to dismiss a page than to swipe up to go into the multi card view, then swipe it up to throw it away, but otherwise it’s really nice. Very much the equal or better than the iPhone OS.
  • Even though the web browser is supposedly based in WebKit, same as the iPhone, it doesn’t do GMail right – you press the “Archive” button and it doesn’t take you back to the Inbox screen – although refreshing the screen shows that the message was archived, and sometimes it cuts off the bottom of mail and you can’t scroll down.
  • The built in mailer is better but it doesn’t thread or group by subject (much like SnapperMail or Apple Mail) and when you hit the delete button it somehow really deletes the message instead of
    archiving it like SnapperMail does.
  • The battery life seems pretty poor compared to the Treo, but of course I’m using it more right now, and I haven’t charged it overnight yet. But an hour or so of constant web browsing seems to use about 50% of the battery.
  • The Sprint GPS app seems extremely good – as good or better than my Garmin nuvi, although I wish it were louder.
  • The bastards used yet another incompatible connector instead of a standard mini-USB so you have to use their cable to charge it.
  • The iTunes integration seems to be working fine, although I can’t tell if it synced contacts.
  • The ability to merge contacts is great, although I kind of wish it hadn’t dragged in every person I’ve ever sent email to on Google.
  • Same with the calendar integration – it brought in every calendar I share, even the ones I normally turn off. You can either view only one calendar, or all of them. There is no way to turn off “Vicki’s Work Calendar” and “Ubuntu Local Community” and keep all the rest on.
  • Tasks seem to have no ability to make repeating entries. Funny how Palm OS used to do that so well, but WebOS can’t. But then again, neither can Google Calendar tasks.

All in all, I think the Pre is going to be a good phone, but I wish it got better reception in the house.

Oh yeah, /tmp is *temporary*

I was storing some files that were semi-important to the project I’m working on in /tmp. I knew that there is a process on some Unix computers that cleans out the stuff in /tmp either on boot or on a schedule, but I didn’t know if it did that on my Mac. So while I’d sort of had a flag in the back on my head to move that to somewhere less fragile, I never got around to it. And I got to working on another part of the project for a few days and forgot about them. And in the mean time, the files haven’t been touched, and I’ve installed an OS update and rebooted. And now I go back, and they’re gone. “Oh yeah”, I think, “/tmp is temporary”. So then I look to see if Time Machine has a backup, and of course Time Machine excludes /tmp because, oh yeah, /tmp is *temporary*.

I can recreate the files, but it’s a waste of a few hours. This time I’m going to recreate them in ~/data/.

This week’s interesting discoveries about Nutch

I’ve been working a lot with Nutch, the open source web crawler and indexer, and the first thing I found was that it was downloading web pages every day, instead of sending the “If-Modified-Since” header and only downloading ones that changed. Ok, I thought, I’ll fix that – since the information I want isn’t in the “datum.getModificationDate()”, I’ll use “datum.getFetchDate()”.

Second interesting discovery: Nutch then doesn’t index pages that returned 302 (not changed), and since the index merging code doesn’t seem to work, I can’t these pages that I cleverly managed to avoid downloading. Ok, I’ll fix IndexMapReduce and delete the code with the comment that says “// don’t index unmodified (empty) pages”, and resist the urge to send a cock-punch-over-ip to whoever wrote that comment for not realizing that “unmodified” does not mean “empty” by any stretch of the imagination.

Third interesting discovery: It turns out that some bright spark decided that when you’re crawling a page that’s never been loaded before, “datum.getFetchDate()” gets the current time, instead of any useful indication that it’s never been fetched before. So scratch my first fix, and go looking for why datum.getModifiedDate() isn’t set. And discover that it appears that datum.setModifiedDate() is never called except by code trying to force things to be recrawled. Yes, instead of forcing a new crawl by modifying the locally generated “fetch date”, they fuck around with the “modified date”, which is supposed to come originally from the server. My opinion of the quality of this crawler code is rapidly going down hill. But my patch to set the modification date according to the page’s metadata appears to be working. Sort of.

Fourth discovery, and one I can’t blame on Nutch: My Rochester Flying Club pages use shtml (Server Parsed HTML) so that I could include a standard header and navigation bar in each page. I could have used a perl script to automatically insert the header into the pages and regenerate them whenever anything changed, but this seemed a lot easier at the time. But one consequence that I’d never noticed before – the server doesn’t send a “Modification-Date” in the header meta data, so evidently these pages are never cached by any browser or crawler. Ooops.

Working Again!

I probably should have mentioned this earlier, but I’ve held back because

  1. It doesn’t feel completely real in some ways and
  2. I’m under NDA and don’t want to say too much

It’s a Java job, working with Nutch, Lucene and Solr. It’s a one person start-up, and the owner has expectations that she’ll need a Chief Technical Officer soon, so this contract I’m doing is sort of an audition for the CTO job. She has an existing code base that she paid a consulting company to write, but she has a long list of things she needs done to it, and I’m making my own list as I go along. (Unit tests and fixing the things FindBugs finds tops my list, but so is fixing the horribly manual deployment process.) The current contract is fixed price, but I should be able to do it quickly enough to make it a decent hourly wage. I’m currently working on my own at home, but she’s going to have a two person office in a week or so at a local technology incubator. I told her that as long as it has wifi, a big whiteboard and access to a fridge, I’d be happy. As an added bonus, the technology incubator is pretty close to where Vicki works, so we’ll be able to meet for lunch or go to the gym together.

Because I’m working on my own, but expecting to work with others in the future, the first thing I did with her existing source code was to put it into “git”. (Maybe it’s because git isn’t as obtrusive as something like ClearCase where you have to check things out, I haven’t found any decent Eclipse plug-ins for git.) As well as learning git, nutch, lucene and solr (not to mention the technologies they depend on, like Hadoop), I’m also learning a bit about making a build environment in ant.

All in all, it’s fun and interesting work, and whether the company fizzles out or grows to a hundred employees, it’s going to be a great learning and growth opportunity for me. And hell, even if it sucked as much as my last job (which it doesn’t) it would be better than unemployment.