I’ve been working a lot with Nutch, the open source web crawler and indexer, and the first thing I found was that it was downloading web pages every day, instead of sending the “If-Modified-Since” header and only downloading ones that changed. Ok, I thought, I’ll fix that – since the information I want isn’t in the “datum.getModificationDate()”, I’ll use “datum.getFetchDate()”.
Second interesting discovery: Nutch then doesn’t index pages that returned 302 (not changed), and since the index merging code doesn’t seem to work, I can’t these pages that I cleverly managed to avoid downloading. Ok, I’ll fix IndexMapReduce and delete the code with the comment that says “// don’t index unmodified (empty) pages”, and resist the urge to send a cock-punch-over-ip to whoever wrote that comment for not realizing that “unmodified” does not mean “empty” by any stretch of the imagination.
Third interesting discovery: It turns out that some bright spark decided that when you’re crawling a page that’s never been loaded before, “datum.getFetchDate()” gets the current time, instead of any useful indication that it’s never been fetched before. So scratch my first fix, and go looking for why datum.getModifiedDate() isn’t set. And discover that it appears that datum.setModifiedDate() is never called except by code trying to force things to be recrawled. Yes, instead of forcing a new crawl by modifying the locally generated “fetch date”, they fuck around with the “modified date”, which is supposed to come originally from the server. My opinion of the quality of this crawler code is rapidly going down hill. But my patch to set the modification date according to the page’s metadata appears to be working. Sort of.
Fourth discovery, and one I can’t blame on Nutch: My Rochester Flying Club pages use shtml (Server Parsed HTML) so that I could include a standard header and navigation bar in each page. I could have used a perl script to automatically insert the header into the pages and regenerate them whenever anything changed, but this seemed a lot easier at the time. But one consequence that I’d never noticed before – the server doesn’t send a “Modification-Date” in the header meta data, so evidently these pages are never cached by any browser or crawler. Ooops.