Google ? Lying ? Noooo……

Checking stats for my WP pages, I see this in the stats:

I’ve included MSNBot there for comparison. Note the amount of data and the last visit.

Now look what is in my robots.txt:

User-agent: Googlebot
Disallow: /

Now let’s check what the Google help pages say:
“robots.txt is a standard document that can tell Googlebot not to download some or all information from your web server.” ( http://www.google.com/bot.html#robots )

This site and this site BOTH say my robots.txt is valid. So what gives ? Google say they obey it – and would you believe that they are lying ?

14 thoughts on “Google ? Lying ? Noooo……

  1. My full robots.txt

    User-agent: HenryTheMiragoRobot
    Disallow: /

    User-agent: Googlebot
    Disallow: /

    User-agent: *
    Disallow: /gallery
    Disallow: /games
    Disallow: /images
    Disallow: /nota
    Disallow: /stats
    Disallow: /upb
    Disallow: /getout.php

    Their docs say the bot obeys the FIRST rule it finds specific to it.
    Obviously it does not.

    I will be asking them what’s up, yes.

  2. From the help pages:

    The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule.

    In my robots.txt, the longest does not apply, the second is as specific as you can get.
    Robots.txt is created to allow flexibility and that is an option I wish to exercise. MSNBot is welcome here, Googlebot is not.

    I want the Google person to tell me how to exclude their bot because even though I am following what i think are the rules, Googlebot is disobeying them.
    If I have not heard a decent reply in a few days I will post this over at Webmasterworld and other such forums to both publicise it and get more information.

    Fact is that even when Google did crawl my site they refuse to return me in results. They’ve taken 100meg of my data – go search for ‘tamba2’ – you will not find a single direct link. Not one. Hardly fair is it ?

  3. Well, I’m not a native English speaker, but to me, your last quote says

    Googlebot is taking the longest rule applicable it can find.

    So, basically, Googlebot sees:

    User-agent: Googlebot
    Disallow: /

    which is very specific but two lines long, and then it finds this:

    User-agent: *
    Disallow: /gallery
    Disallow: /games
    Disallow: /images
    Disallow: /nota
    Disallow: /stats
    Disallow: /upb
    Disallow: /getout.php

    And that rule includes Googlebot and is way longer than the first one, so it’s picked…
    At least that’s what I think… :wink:

  4. TigerDE2 – that could well be true ….
    And this leads to what made me ban Googlebot in the first place – go to google and look for “Mark tamba2 wordpress”. Now if Googlebot has been taking my data, why will it not return that data in a search ?

    I’m going to look again.

  5. The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule.

    Now that’s just plain idiotic. Googlebot should obey any rules set to “User-agent: Googlebot” and any rules set to “User-agent: *”, not just the longest. Basically, Google’s help pages say two things:

    1. “Use robots.txt to block the Googlebot.”

    2. “The Googlebot will not obey the standard syntax of robots.txt file.”

    This remind me of the average American:

    1. The average American will not vote for the President that’s best for him/her.

    2. He/she will vote for the President with the biggest smile.

Leave a Reply to MacManX Cancel reply

Your email address will not be published. Required fields are marked *