Checking stats for my WP pages, I see this in the stats:

I’ve included MSNBot there for comparison. Note the amount of data and the last visit.
Now look what is in my robots.txt:
User-agent: Googlebot
Disallow: /
Now let’s check what the Google help pages say:
“robots.txt is a standard document that can tell Googlebot not to download some or all information from your web server.” ( http://www.google.com/bot.html#robots )
This site and this site BOTH say my robots.txt is valid. So what gives ? Google say they obey it – and would you believe that they are lying ?
Wow, that’s very wrong. Mail the bastards. Or won’t they listen?
I’ve noticed the same. I have robots.txt set to disallow all robots access to the cover images from my library, yet I still find them listed on Google.
My full robots.txt
Their docs say the bot obeys the FIRST rule it finds specific to it.
Obviously it does not.
I will be asking them what’s up, yes.
I’ve submitted a question asking for an explanation.
If you are the person from Google looking at this, you can post here with the answer if you want ..
this is mine:
User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
Disallow: /scgi-bin/
Disallow: /old/
Disallow: /new/
Disallow: /backup/
Disallow: /_images/
Disallow: /webalizer/
Disallow: /willoway/
Disallow: /stuff/
Disallow: /images/
and I think that keeps googlebot at bay.
From the help pages:
In my robots.txt, the longest does not apply, the second is as specific as you can get.
Robots.txt is created to allow flexibility and that is an option I wish to exercise. MSNBot is welcome here, Googlebot is not.
I want the Google person to tell me how to exclude their bot because even though I am following what i think are the rules, Googlebot is disobeying them.
If I have not heard a decent reply in a few days I will post this over at Webmasterworld and other such forums to both publicise it and get more information.
Fact is that even when Google did crawl my site they refuse to return me in results. They’ve taken 100meg of my data – go search for ‘tamba2’ – you will not find a single direct link. Not one. Hardly fair is it ?
Well, I’m not a native English speaker, but to me, your last quote says
So, basically, Googlebot sees:
which is very specific but two lines long, and then it finds this:
And that rule includes Googlebot and is way longer than the first one, so it’s picked…
At least that’s what I think… 😉
Mark, you were just owned by TigerDE2. 😛
I think (s)he’s correct.
TigerDE2 – that could well be true ….
And this leads to what made me ban Googlebot in the first place – go to google and look for “Mark tamba2 wordpress”. Now if Googlebot has been taking my data, why will it not return that data in a search ?
I’m going to look again.
Now that’s just plain idiotic. Googlebot should obey any rules set to “User-agent: Googlebot” and any rules set to “User-agent: *”, not just the longest. Basically, Google’s help pages say two things:
1. “Use robots.txt to block the Googlebot.”
2. “The Googlebot will not obey the standard syntax of robots.txt file.”
This remind me of the average American:
1. The average American will not vote for the President that’s best for him/her.
2. He/she will vote for the President with the biggest smile.
Why do you ban Googlebot in the first place? Is it because you weren’t getting high ranking as you seem to indicate in comments?
I think you should let them know. They do respond fairly quickly.
I had a PR of 6 or 7 – I forget.
I have pointed this post out to them on the day I made the post. I have heard nothing from them.
Oh, Mark has plenty of words to say on that subject. ^_^
http://www.romanticrobot.net/archives/2005/03/23/google-steals/
http://www.romanticrobot.net/archives/2005/03/27/more-on-google/
http://www.romanticrobot.net/archives/2005/04/24/google-screws/