Seo Tools

robots.txt Without Mythology: A Practical Guide That Actually Matches How Crawlers Behave

Publicado: 2026-05-24

por Editorial Team

seorobots.txttechnical-seocrawl-budgetrfc-9309ai-crawlers

The One Thing robots.txt Is Not

The single most expensive misconception in technical SEO is the belief that robots.txt controls indexing. It does not. It controls crawling. A URL that is blocked by Disallow can still appear in Google's index — with no title, no snippet, no useful search performance — if Google learns about the URL from any other source: a backlink, a sitemap entry, an internal link from a page Google is allowed to crawl. The result is the dreaded "Indexed, though blocked by robots.txt" status in Search Console: a URL that is in the index but cannot be properly served, which is roughly the worst of both worlds.

If you want a URL out of the index, you have to let Google crawl it so it can read a noindex meta tag or an X-Robots-Tag response header. If you want a URL out of the crawl budget, robots.txt is the right tool. Confusing these two operations is the cause of a surprising fraction of "why is this URL still ranking?" tickets — and the fix is almost always to remove the robots.txt block, add a noindex tag, wait for re-crawl, and only then re-block in robots.txt to save the crawl budget once the page has dropped from the index.

Once that distinction is internalized, robots.txt becomes a precise, predictable tool. The rest of this article walks through how the matching actually works, which directives are real standards versus historical artifacts, and the workflow I use to validate a robots.txt before any deploy.

RFC 9309: The Spec Finally Got Written Down

For most of its life, robots.txt was a de facto standard — a 1994 draft on a Stanford mailing list that crawler operators implemented with their own interpretations. In 2019 Google open-sourced its reference parser. In 2022 the IETF formalized the behavior as RFC 9309. The spec captures three rules that matter:

Group selection by user-agent specificity. When a crawler reads robots.txt, it picks the group whose User-agent name is the most specific case-insensitive substring match for its own name. If multiple groups match, the longest declared User-agent wins. If none match, the wildcard User-agent: * group applies. If there is no wildcard either, every URL is allowed.
Longest-match-wins within the chosen group. The crawler collects every Allow and Disallow rule whose path matches the request URL. The rule with the most literal characters in its path wins. This is the surprising part: rule order does not matter. A Disallow declared at line 3 can be overridden by an Allow declared at line 20 simply because the Allow's path is longer.
Allow beats Disallow on ties. When an Allow and a Disallow rule match with the same literal length, Allow wins. This is what makes the canonical Disallow: /private/ plus Allow: /private/public/ pattern produce the intuitive result that /private/public/anything is crawlable.

The wildcards * (any sequence, including empty) and $ (end-of-URL anchor) are part of the spec. Disallow: /*.pdf$ blocks /report.pdf but not /report.pdf?download=1. Disallow: /api/ blocks /api/users but also /api/private/users. Trailing slashes matter: Disallow: /admin/ does not block /administrator, but Disallow: /admin does. These are the kinds of details where intuition fails and a tester is invaluable.

The Five Mistakes That Break Real robots.txt Files

Most robots.txt mistakes I see in production fall into one of five categories.

1. Paths without a leading slash. A rule like Disallow: admin/ is silently ignored by Google. The spec requires paths to start with / or *. The fix is one character, but the impact is total: the rule does nothing. CMS templates that build the path with relative paths or that strip leading slashes during string concatenation are the usual culprit. The hreflang checker and the robots tester both flag this; in the tester it shows up as PATH_MISSING_LEADING_SLASH.

2. Rules outside any group. A Disallow or Allow that appears before any User-agent declaration is ignored. The fix is to add a User-agent: * line above the rules. The cause is usually a copy-paste from an example that included only the rule lines, or a programmatic generator that forgot the User-agent header for the wildcard group.

3. Relying on the longest-match algorithm without understanding it. The most common conversation I have in code review goes like this: "I added Disallow: /admin/ but Googlebot is still crawling /admin/public/page." The answer is almost always that there is an Allow rule somewhere — often a generic Allow: / in a different group or earlier in the file — whose path is longer than expected. The tester surfaces this by showing the exact matched rule and its line number, which usually makes the misunderstanding obvious.

4. Using Noindex in robots.txt. Google briefly supported a non-standard Noindex: directive in robots.txt around 2019. It was never documented. Support was withdrawn the same year. SEO articles from that era still reference it, and authors who learned robots.txt then sometimes write rules like Noindex: /private/ and trust that Google will drop the URL from the index. Google will not. The directive is ignored, the URL stays indexable (or worse, gets stuck in "Indexed, though blocked by robots.txt"), and the team has a false sense of security. The tester flags this as NOINDEX_IN_ROBOTS with a recommendation to use a meta tag or X-Robots-Tag instead.

5. Crawl-delay assumptions. Crawl-delay is honored by Bingbot and Yandex. Googlebot ignores it. Teams that need to throttle Googlebot specifically have to use the crawl rate setting in Search Console (when available) or, if the server is genuinely overloaded, return HTTP 503 responses with a Retry-After header. The tester reports Crawl-delay as informational so you know it is being applied but only by a subset of crawlers.

AI Crawlers: The New Category

The newest category of crawlers in most teams' robots.txt audits is the AI bots. The major ones to know:

GPTBot. OpenAI's crawler for collecting training data. Documented and respects robots.txt. Block with User-agent: GPTBot\nDisallow: /.
ChatGPT-User. OpenAI's crawler triggered by user prompts inside ChatGPT (the "browse" feature). Different from GPTBot in that it is not for training, it is for live retrieval.
Claude-Web. Anthropic's web fetcher. Used for live retrieval in Claude products.
PerplexityBot. Perplexity's crawler for its AI search product.
Bytespider, CCBot, Amazonbot, and others. Many of these were ignoring robots.txt as recently as 2023 but most now respect it after public pressure.

The decision of which AI bots to allow or block is a policy question, not a technical one. The technical part is verifying that the rule you wrote actually does what you think it does. The robots.txt tester's user-agent picker lets you switch between these bots and see how each one would interpret your file — useful when your rules use a wildcard User-agent: * with overrides for specific bots, because the most-specific-match rule means it is easy to get the precedence wrong.

The Sitemap Directive

The Sitemap directive is the one part of robots.txt that is not about crawling rules at all. It tells crawlers where to find your XML sitemaps. The rules are simple but worth getting right:

Each Sitemap line is independent; multiple lines are allowed and additive.
The URL must be absolute (the spec is explicit: relative URLs are not valid). Sitemap: /sitemap.xml is non-conformant; Sitemap: https://example.com/sitemap.xml is correct.
The Sitemap directive applies to the entire robots.txt regardless of which User-agent group it appears in. By convention it goes at the top or the bottom, outside any group.
Multiple sitemaps and sitemap indexes are supported. Large sites typically declare one sitemap index that points to many child sitemaps.

The tester extracts every Sitemap directive into its own tab, validates the absolute-URL requirement, and renders the URLs as clickable links so you can spot stale references during the same audit pass. Pairing the tester with the Sitemap Comparator on the same host catches most pre-deploy issues in one workflow.

A Workflow That Catches Mistakes Before They Ship

Generate or export the proposed robots.txt from your CMS, build pipeline, or hand-edited file. Paste it into the tester's left panel.
List the URLs you actually care about in the right panel — high-traffic pages, sensitive admin paths, gated content, anything you have changed recently. The tester does not crawl your site, so the inputs are entirely under your control.
Test against Googlebot first. Pick Googlebot in the user-agent dropdown, click Test, review the Verdicts tab. Every Blocked result should be intentional; every Allowed result for sensitive content should also be intentional.
Switch user agents to verify policy. Re-run with Bingbot, GPTBot, Claude-Web, PerplexityBot. If your policy says "block GPTBot, allow Claude-Web," the verdicts should reflect that. If they do not, your rules have a precedence bug.
Review the Findings tab. Any error means a rule is silently ignored. Any warning means the file is technically valid but using a deprecated or non-standard feature. Resolve both before shipping.
Cross-check the Sitemaps tab. Every declared Sitemap should be reachable and current. Stale references confuse crawlers and pollute Search Console reports.
Export verdicts as CSV. Attach to the deploy PR as the audit artifact. The next person reviewing this file will have a record of what was tested and what the expected behavior is.
Wire the parser into CI. The same parsing and matching logic is exposed by the open-source @anthropic-tools/tools-core package. A CI test that loads the production robots.txt and asserts the expected verdict for a fixed list of canary URLs turns robots.txt correctness into a regression-tested property.

The Boundary: What robots.txt Cannot Do

Two final caveats before you ship:

robots.txt is not a security boundary. The file is publicly readable. Listing your admin paths in robots.txt is effectively publishing them. If you need access control, use HTTP authentication, IP allowlists, or an authenticated route guard — not Disallow rules.
Malicious crawlers ignore it. RFC 9309 is a polite-bot protocol. Scrapers, content thieves, and competitor data collectors will fetch your URLs regardless of what robots.txt says. If your traffic logs show fetches from a bot that should be blocked, the answer is server-level blocking (firewall, WAF, rate limit), not a stronger robots.txt rule.

The Bottom Line

robots.txt is a small file with outsized impact on crawl budget and AI training policy. The RFC 9309 algorithm is precise and testable, the common mistakes are well-known, and the cost of getting it wrong — wasted crawl budget, exposed admin paths, the wrong content used for AI training — is real. The Robots.txt Tester implements the same algorithm Googlebot uses, runs entirely in your browser, and turns a file you might otherwise hand-audit into a one-minute pre-deploy check. Paste your file, list the URLs you care about, pick the user-agent you want to verify, and see what your crawlers actually see.

← Volver al Blog