As a reminder, John Mueller is Google’s Senior Webmaster Trends Analyst.

So, John Mueller’s robots.txt file has been highly sensationalized on Twitter and LinkedIn in recent days, sparking a lot of discussion among webmasters because of the strange search engine guidelines it contains and its unusually large size.

It all started when someone on Reddit claimed that Google employee John Mueller’s blog had been hit by the Helpful Content system and subsequently deindexed. The truth turned out to be less dramatic, but still a bit odd.

After a tweet on Reddit, Twitter reported that John Mueller’s site had been de-indexed, which means it violated Google’s algorithm.

It was enough to load the site’s robots.txt to realize that something strange was going on:

  • The first line, which is something you don’t see every day, is a robots.txt ban. Who is using their robots.txt to prevent Google from visiting robots.txt? Now we know who.
  • The next part of robots.txt is blocking all search engines from accessing the site and robots.txt. This probably explains why the site was deindexed in Google. But it doesn’t explain why it was still indexed in Bing (perhaps Bing is ignoring the robots.txt file altogether).
  • One of the directories blocked by Mueller’s robots.txt is /nofollow/ (a strange name for a folder). There is almost nothing on this page except site navigation and the word Redirector.

John Mueller, who seems amused that his robots.txt is getting so much attention, posted on LinkedIn an explanation of what’s going on: “disallow: /robots.txt – does it make the robots go round and round? Does it deindex your site? No.

It’s just that my robots.txt file has a lot of stuff in it, and it would be better if it wasn’t indexed along with the content. This just blocks the robots.txt file from being crawled for indexing purposes. I could also use the HTTP header x-robots-tag with noindex, but in that case, it would also be in the robots.txt file.”

He said that the nofollow in robots.txt was just to keep it from being indexed as an HTML file and that he added a ban on top of that section in the hope that it would be seen as a “general ban”, but it’s not clear what ban he’s talking about. There are exactly 22,433 bans in his robots.txt file.

Mueller also said the following about the file size:

The size is derived from tests of various robots.txt testing tools that my team and I have been working on. The RFC states that the crawler should handle at least 500 kilobytes. You have to stop somewhere; you can’t make pages infinitely long (and I did, and many people did, some even on purpose). What happens in practice is this: the system that checks the robots.txt file (the parser) cuts off somewhere.”

The mystery of the deindexed site was also solved. It turned out that Mueller’s site didn’t drop off the list because of robots.txt. He used Search Console for that, as Mueller himself said it was some kind of failed experiment. Mueller explained what he did in a follow-up post on LinkedIn:

I used the Search Console tool to try something out. I’ll probably recover quickly if I hit the right button :-). Google has humans.txt. I don’t know what to put in mine. Do you have one?

At this point, Mueller’s site is (and rather quickly) back in the search engine.

And there you have it – John Mueller’s strange robots.txt and Google’s “chief experimenter” can be seen at this link: https://johnmu.com/robots.txt.

Source: searchenginejournal

SEM MasterPlus: Complex website promotion.

SEO