This week’s interesting discoveries about Nutch

I’ve been working a lot with Nutch, the open source web crawler and indexer, and the first thing I found was that it was downloading web pages every day, instead of sending the “If-Modified-Since” header and only downloading ones that changed. Ok, I thought, I’ll fix that – since the information I want isn’t in the “datum.getModificationDate()”, I’ll use “datum.getFetchDate()”.

Second interesting discovery: Nutch then doesn’t index pages that returned 302 (not changed), and since the index merging code doesn’t seem to work, I can’t these pages that I cleverly managed to avoid downloading. Ok, I’ll fix IndexMapReduce and delete the code with the comment that says “// don’t index unmodified (empty) pages”, and resist the urge to send a cock-punch-over-ip to whoever wrote that comment for not realizing that “unmodified” does not mean “empty” by any stretch of the imagination.

Third interesting discovery: It turns out that some bright spark decided that when you’re crawling a page that’s never been loaded before, “datum.getFetchDate()” gets the current time, instead of any useful indication that it’s never been fetched before. So scratch my first fix, and go looking for why datum.getModifiedDate() isn’t set. And discover that it appears that datum.setModifiedDate() is never called except by code trying to force things to be recrawled. Yes, instead of forcing a new crawl by modifying the locally generated “fetch date”, they fuck around with the “modified date”, which is supposed to come originally from the server. My opinion of the quality of this crawler code is rapidly going down hill. But my patch to set the modification date according to the page’s metadata appears to be working. Sort of.

Fourth discovery, and one I can’t blame on Nutch: My Rochester Flying Club pages use shtml (Server Parsed HTML) so that I could include a standard header and navigation bar in each page. I could have used a perl script to automatically insert the header into the pages and regenerate them whenever anything changed, but this seemed a lot easier at the time. But one consequence that I’d never noticed before – the server doesn’t send a “Modification-Date” in the header meta data, so evidently these pages are never cached by any browser or crawler. Ooops.

How not to drum up business

There is a business here in Rochester that needs a lesson how to do business. I’m not going to give them the exposure (or Google rank) of putting their name here, but their name sounds a little like “Cock fire”. The business they are in is something that is actually of interest to me, something I currently use, and something that I recently solicited quotes from numerous companies in the business by going to a site that collects your requirements and sends them to registered providers. It’s also a business that members of a Linux Users Group, such as our own Linux Users Group of Rochester (LUGOR) might be more likely than the general public to want to do business with.

But “Cock fire”, instead of waiting for requests for quotes, or introducing themselves to the LUGOR group as a peer or contributor, instead decided to somehow mine our mailing list for email addresses, and then individually spammed the members of the list. When I got mine, I actually thought it was somehow related to my earlier request for quotes, until I realized that they’d sent it to both of the email addresses I’ve subscribed to the mailing list, not just the one I’d used in the RFQ. And then somebody else on the list mentioned that they’d gotten this spam to an address that they *only* use for the LUGOR list, and several other members piped up that they’d also gotten spammed, so we figured out what they’d done.

So well done, “Cock fire”. In spite of the fact that your product is actually $10 a month cheaper than what I currently pay your competitor, I’m not going to switch my existing use over, and neither am I going to recommend you to my current employer. Reap what you sow, assholes.

Update: I got a response from the email I sent them.

On behalf of [Cock fire] I would like to formally apologize for the e-mail marketing to your group. I was given 2 lists of e-mails from a Rochester Linux guru that said their group would be interested in Rochester based services. From the number of negative responses I have gotten back this was a mistake.

We have deleted all LUGOR e-mails and will not be in future communication. Please convey our apology to the group.

So it wasn’t his fault that they decided to spam, it was the fault of somebody who tempted him into it by giving him a list of email addresses. Oh, then that makes it all ok then? I don’t think so.

Obviously some sort of scam, but what?

We just got a voice mail welcoming us to a new service at Frontier. The call came from 570-631-4560, and the message said to call 888-791-9198. Obviously I was suspicious because we didn’t have any new service, and so I checked on-line and none of those numbers agree with anything that Frontier normally uses. I called Frontier’s advertised customer support number, and sure enough there was no change on our account and nothing to welcome us to. So obviously the point of the scam was to get you to reveal details of your phone account to some nefarious third party, but I wonder why?

I don’t have to ask if people are stupid enough to call somebody they think is the phone company, and not get suspicious when they don’t know your name and address based on your phone number, because there is ample evidence of just how stupid people are all over the internet.

If you get a call from these scammers and you’re smart enough to not call them back without first googling it, I’m hoping that by posting it here somebody will find this warning. Good luck.

Do me a favour?

Update: Steve Robbins has modified his widget to use JSON, and I’ve gone back to using it because it works right at any text size.

The Stack Overflow team is sending me emails saying that my use of the Robbins Stack Overflow widget on my blog is putting an “unacceptable” load on their huge 48 Gb RAM, 8 processor box. So I’ve switched over to their preferred solution, which is an iframe containing their own “flair” page. The problem with the iframe option is that it requires me to tell my system exactly how many pixels high and wide it is – and when I change my text size, it starts putting scroll bars on it, and it looks like ass.

Can you please look at my Stack Overflow badge on the right side of my blog, and leave me a comment telling me yes or no if it has scroll bars for you? If you know, tell me what OS/browser and font size you’re using.

That’s not good

On Thursday, I did some extensive work on the positioning of the rudder pedals on my kayak. I drilled some new holes in the boat, eliminated the extraneous rails, and moved the new pedal rails in order to reduce the way the rudder pedal wire guides have been gouging holes in my lower legs and to raise the pedals so that I can push better without pushing the rudder back and forth.

Today I went out to try to paddle the course of the Armond Basset Race, which takes place next weekend at the Genesee Waterways Center here in town. All I know for sure it that they say it’s “2 laps of 5 miles”, but Mike F says he thinks it’s 1.25 miles upstream in the river, 2.5 miles downstream, 2.5 miles upstream, 2.5 miles downstream, and then 1.25 miles upstream again. That makes sense, because it means that each boat passes the start/finish area 5 times, which would be good for any spectators – on the down side it means that we’re finishing upstream, and the stream is running pretty fast. So Mike and I tried to paddle it, but after 4 miles or so, my hips were really starting to hurt, so I gave up after 1 loop (5 miles). Maybe I went too fast trying to keep up with Mike (who is normally a much faster paddler than I am), but I think it’s mostly because of the new leg position. I’m really hoping that things are better by next Saturday.

I’m looking at my GPS track from today versus the Tupper Lake race, and my heart rate was much lower today. So I couldn’t have been working as hard. But man that current was strong – I was averaging 5.0 mph going up, and 7.4 mph going down.