A few hours ago, Google Webmaster announced via Twitter that after 25 years of being a de-facto standard, Robots Exclusion Protocol has been standardized. Webmasters intended to relive the web developers and owners from finding the answer to the question ‘How do control web crawler’. They worked with the original author of the Protocol, Martijn Koster, to document how REP is used on the modern web and submitted it to the Internet Engineering Task Force (IETF).
Some 25 years ago in 1994, after crawlers were overwhelming his sites, Martijn Koster, obtaining inputs from other webmasters created the Robots Exclusion Protocol to help website owners to manage their server resources easily. But neither did the REP develop into official internet standard nor was it ever updated to cover today’s cases. Different web developers interpreted the protocol differently.
The new REP draft doesn’t change the rules that were created in 1994 by Martijn Koster but rather defines the scenarios for robots.txt parsing and matching that earlier remained undefined. According to Google, the new REP will help standardize robots.txt for the modern web.
- Any URI based transfer protocol can use robots.txt. For instance, it’s not limited to HTTP anymore. It can be used for FTP or CoAP .
- Developers need to parse at least the first 500 kibibytes of a robots.txt. Defining a maximum file size would ensure connections are not open for too long, avoiding unnecessary strain on servers.
- A new maximum caching time of 24 hours or cache directive value, if available, gives website owners the flexibility to update their robots.txt whenever they want. For instance, in the case of HTTP, Cache-Control headers could be used for determining caching time.
- The specification says that when a previously accessible robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.”