How did Google find that?

Google has a blog post showing how they set up some fake search results, and then a short time later Bing started returning the same fake results, and therefore they suspect IE8’s “Suggested Sites” and/or Bing’s “Customer Experience Improvement Program” is spying on what you click and sending the results off to Microsoft.

But before Google gets all high and mighty, I want to tell you about what happened to me. I did some documentation for a customer I was doing some work for. I did it in the form of a TiddleyWiki and stuck it up on a brand new, never used before subdomain of my main domain. Well, she hated it and asked that I do it as a Word document instead, which I did. But I forgot to take it down. No problem, I thought, after all nothing links to it or mentions it in any public place, so how would a crawler find it?

Imagine my surprise when the customer calls me up some time later saying that this old version of the documentation, in a subdirectory on a un-linked to site is showing up in Google searches for her product’s name. How did that happen? Using the advanced search, I couldn’t find anything that linked to it. There was one mention of that domain in a forum post, but in that case I was using the :8080 port because I was referring to the Tomcat server that was also running on that domain.

So as I see it, the choices are:

  • Google saw the mention of the domain in the middle of a forum post, recognized it as a URL (it wasn’t a link) and stripped out the :8080 and crawled the site OR
  • They saw me mention the url in a link I send in a GMail to the customer and used that as an excuse to crawl the site.
  • IE reported the link to Bing when the customer clicked on it and then Google stole it from Bing somehow
  • Chrome reported the link to Google when I clicked on it

Either way, they’re crawling things that aren’t public links. Me thinks Google protest too much.