Google’s Gary Illyes and Martin Splitt quietly did something that no one in the SEO industry had done before: they ran a custom parser across the robots.txt file of millions of real websites and looked at what directives people actually use.
This project transformed a simple GitHub pull request into a comprehensive data study to better align Search Console with actual webmaster behavior.
The team used a custom JavaScript parser to mimic Google’s official C++ logic, allowing them to document real-world usage patterns of robots.txt at scale.
The results, discussed in Episode 108 of Google’s Search Off the Record podcast.
What 16 Million Robots.txt Files Actually Look Like
Three directives dominate everything else. The distribution of directives found across millions of robots.txt files shows an almost vertical drop-off after the three most common entries: allow, disallow, and user-agent. Even plotted on a logarithmic scale, the gap between those three and everything else is stark. For the vast majority of the web, robots.txt is essentially just those three directives in various combinations.
Most sites return a valid robots.txt — but 13% don’t. Of all the URLs in the crawl set, 84.9% return a 200 status code for their robots.txt file. 13% return a 404, meaning no robots.txt exists at all. Timeouts, 403s, and 500 errors each account for less than 1%.
File sizes are small. The overwhelming majority of robots.txt files fall between 0 and 100 kilobytes. There is no practical case for a large, complex file.
The wildcard user-agent dominates. The * user-agent, which applies rules to all crawlers, is by far the most commonly used. It appears across a large share of all robots.txt files in the dataset, confirming that most site owners write blanket rules rather than crawler-specific ones.
Googlebot is named far less often than you might expect. AdsBot-Google appears as a named user-agent in 9.8% of files. Googlebot by name appears in only 6.2%. Most sites that want to control Googlebot’s behaviour are doing it through the wildcard.
Broken files are common. The parser also surfaced a significant number of robots.txt files that are not valid — HTML pages with CSS returned instead of a plain-text directive file, typically because the server has no robots.txt and is returning a 404 page with a 200 status. These show up in the data as lines containing tags like padding and img.
Typos in disallow are a real pattern. The dataset makes it possible to identify common misspellings of the disallow directive. Gary noted he plans to expand Google’s typo-tolerance to account for the most frequent ones found in the data.
What Google Will Do with This
The direct outcome of the project is an expansion of Google’s Search Console documentation: the list of supported and unsupported robots.txt directives will be updated based on what the data shows people are actually using, rather than what was previously assumed.
Directives that appear rarely or not at all have less justification for documentation; those that appear frequently but are unsupported will be explicitly flagged.
The custom metric is now live in the HTTP Archive and will feed into this year’s Web Almanac SEO chapter, giving the broader SEO community access to a more granular view of robots.txt usage than has previously been available.
What This Means for Your robots.txt
The data makes a simple case: robots.txt, for most sites, should be simple. The three directives that cover virtually every real-world need are allow, disallow, and user-agent.
Anything beyond that is used by a small fraction of sites, and if it is not on Google’s supported list, Search Console is likely already flagging it as unrecognised.
If your robots.txt contains custom directives borrowed from a guide or plugin, it is worth auditing. The chances are high that those directives are doing nothing — and the data from 16 million pages now backs that up.
Key Takeaways
Large-Scale Analysis: Google analyzed robots.txt files across 16 million URLs using the HTTP Archive, WebPageTest, and BigQuery.
Dominant Directives: The allow, disallow, and user-agent tags account for almost all usage, with a sharp decline in other directives.
Server Responses: About 84.9% of sites provide a valid 200 status for robots.txt, while 13% return a 404 error.
Bot Mentions: AdsBot-Google appears in 9.8% of files, whereas Googlebot is specifically named in only 6.2%.
Data Quality Issues: A frequent problem discovered was “broken” files, where standard HTML pages are incorrectly served as robots.txt.
Future Updates: These insights will be used to refresh Google Search Console documentation and will be featured in the 2025 Web Almanac.
Link Building: When delivering guest posts or backlink services, ensure the robots.txt if the selected site isn’t blocking Google because the content will not index and the effort to acquire links becomes futile.
Dileep Thekkethil
AuthorDileep Thekkethil is the Director of Marketing at Stan Ventures, where he applies over 15 years of SEO and digital marketing expertise to drive growth and authority. A former journalist with six years of experience, he combines strategic storytelling with technical know-how to help brands navigate the shift toward AI-driven search and generative engines. Dileep is a strong advocate for Google’s EEAT standards, regularly sharing real-world use cases and scenarios to demystify complex marketing trends. He is an avid gardener of tropical fruits, a motor enthusiast, and a dedicated caretaker of his pair of cockatiels.