After 25 Years, There can Finally be an Official Standard for Using Robots.txt, Thanks to Google / Google

投稿者:Google解説ライター奥村
投稿公開日:2020年8月31日
投稿カテゴリー:Google

Around 25 years ago, an official internet standard for the files’ rules. The rules were defined in the Robots Exclusion Protocol (REP) and are still considered as an unofficial standard.

Although the search engines have endorsed REP during the last 25 years, developers often assign their own meanings as it’s unofficial. Moreover, with time, it has become outdated as well, failing to cater to the use cases of today.

Even Google admitted that the ambiguous nature of the standard creates difficulties for website owners to implement the rules correctly.

The Tech Giant then proceeded with a solution as well by documenting how the REP should be applied on modern web. Google then submitted the draft to the Internet Engineering Task Force (IETF) for evaluation.

In 25 years, robots.txt has been widely adopted– in fact over 500 million websites use it! While user-agent, disallow, and allow are the most popular lines in all robots.txt files, we’ve also seen rules that allowed Googlebot to “Learn Emotion” or “Assimilate The Pickled Pixie”.

— Google Webmasters (@googlewmc)

According to Google, the draft includes extensive details regarding the real world experience of depending on robots.txt rules, used by Googlebot, various crawlers as well as over half a billion websites dependent on REP. With the help of these rules, the website publishers gain the power to decide what they would like to be crawled on their site and whether it should be shown to the interested consumers.

It should be noted that the draft doesn’t change the already defined rules. It has just updated them to suit the modern web.

The updated rules include (but are not limited to):

Robots.txt are no longer limited to just HTTP and can now be used by any transfer protocol based on URI.
At least first 500 kibibytes of a robot.txt should be parsed by the Developers themselves.
To provide website owners with flexibility in updating their robots.txt, a maximum caching time of 24 hours or cache directive value (depending on availability) should be brought forth.
After server failures render a robots.txt file inaccessible, known disallowed pages would not be crawled for a specific amount of time.

Google plans on trying its best to make this standard official and for this purpose, it is open to suggestions regarding the proposed document.

おすすめ

今、あなたはGoogle検索の新しいARベースのショッピング機能/グーグルを使用して仮想メイクアップを試すことができます

Googleは、Chromeブラウザ/Googleで以前に検索した製品を思い出させることによって、より多くのオンラインで買い物をしたいと考えています

Googleは、クワガタカブトムシ、アトラスカブトムシなどを含むほぼ2ダースの昆虫をARベースの3Dアニマルコレクションに追加すると伝えられている/ Google