What robots.txt Actually Controls
robots.txt is a public, per-host text file at /robots.txt that tells well-behaved crawlers which URLs they may fetch and which they may not. It is a request, not an enforcement: malicious crawlers ignore it entirely, and even well-behaved ones treat it as a hint, not a security boundary. The single most common misconception about robots.txt is that it controls indexing. It does not. It controls crawling. A URL that is disallowed in robots.txt can still appear in Google's index if Google learns about the URL from other signals (backlinks, sitemap entries, internal links from pages it can crawl). The result is the "Indexed, though blocked by robots.txt" status in Search Console — a URL with no snippet, no title, and effectively zero search performance, but still in the index. If you want a URL out of the index, use a noindex meta tag or X-Robots-Tag header. If you want a URL out of the crawl budget, use robots.txt. The two are different problems.
The RFC 9309 Matching Algorithm
The robots.txt spec was formalized as RFC 9309 in 2022, codifying the behavior Google had implemented for years. The algorithm is more nuanced than most teams realize:
- Group selection. The crawler scans every User-agent declaration and picks the group whose User-agent name is the most specific case-insensitive substring match for its own name. If no group matches, the wildcard
User-agent: * group applies. If there is no wildcard either, every URL is allowed.
- Rule matching within the chosen group. Every Allow and Disallow rule whose path matches the request URL is collected. The path with the most literal characters wins — that is the "longest match" in RFC 9309 §2.2.2. The
* wildcard matches any sequence of characters (including empty); the $ anchor pins the match to the end of the URL.
- Tie-breaking. When an Allow and a Disallow rule match with the same literal length, Allow wins. This is what makes "Disallow: / + Allow: /public/" produce the intuitive result that
/public/anything is crawlable.
The tester implements this algorithm exactly. The Verdicts tab shows the matched rule's line number so you can verify behavior against your authored file rather than against an opaque "allowed/blocked" output.
The Five Rules Every robots.txt Author Should Internalize
- Empty Disallow means allow all.
Disallow: with no value is the canonical way to say "no restrictions for this group." It is not an error.
- Paths must start with
/. A rule like Disallow: admin/ is silently ignored by Google. The tester flags this as PATH_MISSING_LEADING_SLASH.
- Trailing
$ anchors to URL end. Disallow: /*.pdf$ blocks /report.pdf but not /report.pdf?download=1. The tester's wildcard example demonstrates exactly this.
- Order does not matter within a group. Rules are not evaluated top-to-bottom. They are all evaluated and the longest match wins. Reordering rules to "fix" an unexpected result usually masks a deeper misunderstanding of the longest-match algorithm.
- Adjacent User-agent declarations merge.
User-agent: Googlebot\nUser-agent: Bingbot\nDisallow: /no-bots/ creates one group with two declared agents. A blank line or any non-User-agent directive between them creates two separate groups.
Directives That Are Not Actually Standards
Several robots.txt directives are widely used but not part of RFC 9309, and crawlers vary in their support:
- Crawl-delay: Bingbot and Yandex honor it. Googlebot ignores it entirely. Use Search Console's crawl rate setting or Bing Webmaster Tools instead.
- Host: Yandex-specific. Tells Yandex which mirror is canonical. Ignored by Google.
- Noindex: Briefly experimentally supported by Google around 2019; officially unsupported since then. The tester flags this as
NOINDEX_IN_ROBOTS with a recommendation to use <meta robots> or the X-Robots-Tag header instead. Even if a crawler happened to honor it once, relying on this is a regression waiting to happen.
- Clean-param: Yandex-specific. Tells Yandex which URL parameters to ignore for canonicalization.
- Request-rate and Visit-time: historical artifacts almost no modern crawler implements.
The tester recognizes all of these without lint noise (so the parse does not pollute the Findings tab with false negatives) but the deprecated Noindex is explicitly flagged because using it is actively dangerous — it gives authors a false sense that an URL is excluded from the index when in fact it remains indexable.
AI Crawlers: The New Audit Frontier
The user-agent picker includes GPTBot, ChatGPT-User, Claude-Web, and PerplexityBot precisely because controlling AI training and inference fetches is now part of every SEO and content team's robots.txt review. The mechanics are identical to traditional crawlers — robots.txt is a public file, the spec is the same, the user-agent matching follows the same rules — but the policy questions are new: do you want your content used as AI training data? Do you want it cited by AI search products? Each of those decisions is encoded as a per-agent rule in robots.txt, and the tester lets you verify the rule actually does what you think it does before you commit it.
Sitemap Directives: The Quiet Connector
The Sitemap directive is the one part of robots.txt that is not about crawling rules at all. It tells crawlers where to find your XML sitemap(s) — a host-relative or absolute URL pointing to a sitemap or sitemap index. Multiple Sitemap lines are allowed and additive. The tester parses every Sitemap directive into its own tab, validates that each is an absolute URL (the spec requires this), and renders the URLs as clickable links so you can spot stale references during the same audit pass. Pairing the robots.txt tester with the sitemap comparator on the same host is one of the fastest pre-deploy checks an SEO team can run.
What This Tool Does Not Do (And Why That Is Intentional)
The tester does not crawl your site. It does not fetch your robots.txt for you (CORS would block it and most browsers do not expose the response anyway). It does not check whether the URLs you test actually exist or return HTTP 200. Its job is narrow and well-scoped: given a robots.txt text and a list of URL strings, simulate the matching algorithm Googlebot uses, and report the verdict per user-agent. If you need a live crawl, use Screaming Frog or Sitebulb. If you need to test against the production robots.txt, fetch it yourself (browser, curl, or your CMS export) and paste it in. This separation keeps the tool predictable, fast, and offline-capable — which is the entire point of running it in the browser.
Privacy and Local Processing
robots.txt files often reference URLs you do not want logged by a third-party service: pre-launch routes, region-gated content, internal admin paths, and competitor watch lists. The tester runs entirely in your browser tab — paste, parse, test, export, all local. You can verify in DevTools that clicking Test triggers zero outbound requests. The same parser is available as an open-source library if you want to run it in CI, but the web tool itself is fully self-contained.