How did Google find that?

Google has a blog post showing how they set up some fake search results, and then a short time later Bing started returning the same fake results, and therefore they suspect IE8’s “Suggested Sites” and/or Bing’s “Customer Experience Improvement Program” is spying on what you click and sending the results off to Microsoft.

But before Google gets all high and mighty, I want to tell you about what happened to me. I did some documentation for a customer I was doing some work for. I did it in the form of a TiddleyWiki and stuck it up on a brand new, never used before subdomain of my main domain. Well, she hated it and asked that I do it as a Word document instead, which I did. But I forgot to take it down. No problem, I thought, after all nothing links to it or mentions it in any public place, so how would a crawler find it?

Imagine my surprise when the customer calls me up some time later saying that this old version of the documentation, in a subdirectory on a un-linked to site is showing up in Google searches for her product’s name. How did that happen? Using the advanced search, I couldn’t find anything that linked to it. There was one mention of that domain in a forum post, but in that case I was using the :8080 port because I was referring to the Tomcat server that was also running on that domain.

So as I see it, the choices are:

  • Google saw the mention of the domain in the middle of a forum post, recognized it as a URL (it wasn’t a link) and stripped out the :8080 and crawled the site OR
  • They saw me mention the url in a link I send in a GMail to the customer and used that as an excuse to crawl the site.
  • IE reported the link to Bing when the customer clicked on it and then Google stole it from Bing somehow
  • Chrome reported the link to Google when I clicked on it

Either way, they’re crawling things that aren’t public links. Me thinks Google protest too much.

That was an unexpected bonus

Vicki and I are refinancing our house. Mostly this is to get a shorter time period, but also to reduce the interest a bit. We kind of dithered about this and missed the best rates, but the one we got isn’t too bad. That’s not what I’m writing about here though.

Because the 2009 tax year was a little light (because I spent some time unemployed and some time working for peanuts), I dug out both our 2008 and 2009 tax files. And while I was flipping through them, something caught my eye: an amount on the 2008 tax form that said something like “Capital loss carried forward to 2009”. Oh oh, I don’t remember carrying forward any amount. Sure enough, I couldn’t find any mention of it on the 2009 tax form. I also couldn’t remember where this capital loss had come from.

I looked back, and discovered it had to do with the sale of Vicki’s mom’s house back in 2006. I’d been using TurboTax in 2006, 2007 and 2008, and it had carried it forward without my even noticing it. (When you have a large capital loss, you can only claim $3,000 each year and carry the rest forward.) But for 2009, I’d switched to H and R Block on-line for reasons I don’t remember. And because I’d forgotten about the carry forward, I hadn’t applied it.

So today I downloaded the 1040X and IT-201X amended tax forms, and filled them all in manually because H and R Block on-line doesn’t give you any way to go back to last year’s return and amend it. (Something which you can do with TurboTax, I happen to know, because I amended the 2006 return using it.) I filled out all the forms and it turns out we’re due a nice little chunk of change. So that’s a nice little surprise consequence of our refinancing efforts.

One side note: both the IRS and New York State provide handy downloadable PDFs that you can edit and print. But for some bizarre reason that I can’t fathom, the IRS one can be saved but the New York one can’t. Can somebody explain what was the logic behind not allowing you to save?

Today’s discovery about Google Chrome

If you use Google Chrome as your web browser, right click on the address bar, and choose “Edit Search Engines”. You’ll discover that every web site you’ve ever been to with a search box, including this blog, installed as a “Search Engine”. And you can search that site by typing the domain name (like blog.xcski.com) followed by a space followed by your search terms. (I haven’t tested to see if “Clear Browsing History” clears this as well, but if it doesn’t, that might be a surprise if you think you’ve cleared your tracks)

But another interesting use of this is that you can change the short cut. So if I double click on the entry for Wikipedia, and change the “Keyword” from “en.wikipedia.org” to “wiki”, I can search Wikipedia by typing command-L to highlight the current address in the address bar, then typing “wiki Stephen Fry” and hitting return, and going directly to the Wikipedia page about Stephen Fry.

Lifehacker has an article about some other ways you can use this Search Engine capability to be able to do things like enter a Google Calendar event from the address bar.

Surgery scheduled

I’ve got my shoulder surgery scheduled for February 3rd. The doctor says that if things are good inside the shoulder, I could be looking at 1 week in the sling, and only a month or so recovery, but if things are as bad as they were for Vicki, it could be 3 to 4 weeks in a sling, and up to 6 months of recovery. So there is a slight chance I might be racing (although not as well prepared as I was this year) by the end of the season, although I’m shelving plans for the 90 even if things go perfectly.

And in related news: The Onion.

Interesting, but I’m not sure I understand it all

Namebench is a program that analyses DNS lookups to see if your DNS settings are optimal. My results are here. They recommend that I use my ISP’s DNS server, but they also show the main reason I stopped using my ISP’s DNS server – that innocuous “NXDOMAIN Hijacking” notation beside the entry for that DNS server means that if you mistype a domain name, it takes you to your ISP’s search page instead of having your browser tell you that you mistyped a domain name. I HATE that, “with the power of a thousand fiery suns” as Vicki would put it, because it breaks things, usually in ways too subtle for ordinary users to notice. I run a DNS server on my Linux box (on 192.168.1.2) because it won’t do “NXDOMAIN Hijacking”, and also because I believed it would be faster. One other reason for running my own DNS server is so I could reach computers on my home netwrok via a system name rather than via an IP, something that was probably more important when I had multiple Linux boxes that I needed to be able to ssh into than now, when I basically only connect to my Linux box and (more rarely) into my MacBook Pro.

If you look down the page to the graph “Response Distribution Chart”, it shows that for the first 30% of the responses, my home DNS server is *way* faster than the competition – I guess that means that things that it’s already seen and cached, it returns at the speed of the local network. But the graph trails off pretty quickly, and by the time you reach 50% of the responses, it’s slower than most of the other ones – I don’t know why it would be slower than “Internal 192-1-1”, which is the DNS cache on my router, but I suspect that’s because the router will just ask my ISP’s DNS server when it doesn’t know something rather than reaching out to the broader internet.

What I should do now, I think, should be to set up a DNS server on my colo box and see how it compares.