Update: Somebody on the Nutch mailing list pointed me towards the config option “fetcher.threads.per.host”. Increasing that to 10 dropped the time from 45 minutes to 15 minutes on the first crawl and 2 minutes for a re-crawl. Since I fixed Nutch to properly respect the Last-Modified header and If-Modified-Since, I don’t think I’m going to be blocked from crawling sites with multiple threads. Much less worrisome.
Time spent to copy all the files on three small web sites to a directory on my machine using wget: 1 minute 1.114 seconds.
Time spent for Nutch to re-crawl those same web sites: 45 minutes.
It doesn’t seem to matter what I put in the “number of threads” parameter to Nutch, either – it takes 45 minutes if I give it 10 threads or 125 threads.
Even worse for Nutch, out of the box it refetches documents even if they haven’t changed – I had to find and fix a bug to make that part work – but wget does the right thing.
Considering that all I’m doing with the Nutch crawl is going through the returned files one by one and doing some analysis and putting those results in a Solr index, I wonder if I should toss Nutch entirely and just work up something using wget? All I’m really getting out of Nutch is pre-parsing the html to extract some meta data.
Too bad I’ve already spent 3 weeks on this contract going down the Nutch road. At this point, it would be too time consuming to throw away everything I have and start afresh.
Had similar issues with Nutch a few years ago on a project as well… I ended up dumping it using lucene/httpclient/Quartz for crawling and RAM/FS Directory merging to speed up indexing of websites…However over last 2 years I’ve been using SOLR with MUCH success, and may be worth like you already stated you may do..which is to dump the Nutch solution and start fresh, 3 weeks is not to much of a loss in the long run it may be worth it! 🙂
> too time consuming to throw away everything I have and start afresh.
A quote “in this business you __have__ to be ready to throw everything away and start anew at any given time”. I can’t match this quote to a person today.
Luckily, my boss does not always understand my work, so I can (sometimes) continue my old evil ways without getting fired.