Google respond … pt#1

I’m not sure I understand this:

Thank you for your note. Please be assured that our robots do obey robots.txt files. We’d like to investigate further, but you have blocked us from being able to access your robots.txt file. Please note that when our robots see a 403 forbidden error, they interpret this to mean that the site is safe to crawl. For more information on this, please visit http://searchenginewatch.com/sereport/article.php/2164941

So a 403 means what exactly ?

Also, the full robots.txt is actually in the comments to the post they were told to look at. But they did not.

14 thoughts on “Google respond … pt#1

  1. True, it is.
    Reading that article though – and bear in mind I’ve not seen it before and it is certainly not mention in G’s help pages – googlebot sees a site ban and promptly ignores it. It then looks for robots.txt.
    If in robots.txt there is no reference to G-bot, it will crawl.

    So I ban g-bot with a .htaccess rule. I knowingly use this rule. I do it because I want to ban g-bot. Thinking that g-bot is covered, I write a robots.txt that does not refer to it – after all, it’s banned, right ? But along comes g-bot, sees a 403 and ignores it.

    There’s a page in G’s help pages that says they see web pages as a browser such as L.ynx does. L.ynx does not work round .htaccess bans.

    Isn’t this wrong ?

    (L.ynx used to allow posting. I know the real spelling)

  2. Yes, that is disturbing. I can understand the behavior from a programmers point of view tho. Its like “Look for robots.txt… hmmm get a 403, ok, I can’t get a robots.txt to process, priority is to scrape sites so… assume site is safe to crawl as no information to contrary.”

    That’s dumb, but in some respects logical.

    It seems that the only way to disallow GoogleBot would be to let your robots.txt be readable and set googlebot to allow: none.

    Annoying tho…

  3. The annoyance extends though.
    Look at my robots.txt:

    User-agent: HenryTheMiragoRobot
    Disallow: /

    User-agent: Googlebot
    Disallow: /

    User-agent: *
    Disallow: /gallery
    Disallow: /games
    Disallow: /images
    Disallow: /nota
    Disallow: /stats
    Disallow: /upb
    Disallow: /getout.php

    So not only have I banned the bot, I have also told it specifically not to go anywhere but it did because it somehow got hold of 100meg of data.

    This isn’t now so much as Google using code to annoy me, it’s how DO we keep a private company from crawling our sites when on the one hand they state they obey robots.txt yet on another say that robots.txt is a standard not a rule.

    It seems to me that if anyone else can crawl, they will too. I just wish they would clearly explain their actions. The explanation about “we can’t do this, we can’t do that” is frankly crap. Any and all Google pages are instantly PR 10 and during that WP episode, the WP PR was slashed, then restored. That is human intervention and Google is code – and people control code.

    Check this too …. I hit it last night:
    http://www.tamba2.org.uk/google.png

  4. I thought their point was tho that as you’ve banned them from your site (with a 403) they can’t read the robots.txt to know it’s disallowed? Or did I mis-interpret?

    And yeah… they taken action when they feel motivated to. You are totally right: people control code.

    re that image: interesting, very interesting. I assume you double checked for spyware? Just in case they were correct?

  5. I read it as they’ve seen the 403 and “they interpret this to mean that the site is safe to crawl” and THEN they see the robots.txt which they also ignore – after all, if googlebot is banned, it gives the competition an edge.

    And that page ? Nah … I was playing with anonymity programs ;)

  6. From Google:

    Thank you for your reply. Once again, our robots do obey robots.txt files. However, our robots only check for a robots.txt file once per day when crawling your site. It’s possible that our robots had not yet noted your robots.txt file when you saw the accesses on your server.

    If you’d like to continue preventing our robots from crawling your site, you can continue to include the following in your robots.txt file:

    User-agent: Googlebot
    Disallow: /

    Okaaaay…..but that was already there. So how did it grab 100 megabytes of data ?

  7. I’m actually going back to my original problem.
    I banned Google because they were crawling my site but not returning me in any results. Google now have exactly the same access as MSNBot and Inktomi – so I wait to see if they return me too. Position in results in unimportant – just being there would make a change.

Leave a Reply

Your email address will not be published. Required fields are marked *