NEW!

Robots.txt Tester

Test URLs against your robots.txt rules per user agent. Catches longest-match conflicts, syntax errors, unknown directives, and sitemap mistakes. 100% local.

100% Private & Secure

All processing happens locally in your browser. Your files never leave your device.

Client-Side Processing No Server Uploads No Registration Required
Examples
Drop a robots.txt file or click to browse

All processing happens locally in your browser. Your robots.txt and URL list never leave your device.

Keywords

robots.txt testerrobots.txt validatorrobots txt checkergooglebot testrobots.txt syntaxrfc 9309crawler rulesurl allow disallow

Need something else?

How to use

1

Paste your robots.txt content into the left panel, or drop a saved robots.txt file onto the upload zone. The parser handles comments, blank lines, multi-agent groups, and the standard directives (User-agent, Allow, Disallow, Sitemap, Crawl-delay).

2

Paste the URLs you want to test into the right panel, one per line. These can be full URLs (https://example.com/admin/) or path-only entries (/admin/). The tool extracts the path-and-query portion for matching.

3

Pick a user agent from the dropdown. Defaults to Googlebot, but you can test Bingbot, GPTBot, Claude-Web, PerplexityBot, and other major crawlers — group selection follows the most-specific match per RFC 9309, falling back to the wildcard User-agent: *.

4

Click Test URLs. Each URL gets an Allowed or Blocked verdict, the matched rule with its line number, and the reason (longest-match, no rule, or no applicable group). The syntax linter surfaces parsing issues separately in the Findings tab.

5

Review the tabs: Verdicts for per-URL results, Findings for syntax issues, Sitemaps for the declared sitemap URLs, Groups for the parsed user-agent groups and their rules. Export verdicts to CSV for an audit trail or to share with the team.

Features

RFC 9309 Longest-Match-Wins Semantics

Matches Google's implementation exactly: when multiple rules match a URL, the one with the longest literal path wins, and Allow beats Disallow on ties. Wildcards (*) and end-of-URL anchors ($) are honored.

Per-User-Agent Group Selection

Picks the most-specific User-agent group for the agent you select, falling back to User-agent: * if no specific match. Lets you see how Googlebot, Bingbot, GPTBot, and other crawlers experience the same robots.txt differently.

Built-in Syntax Linter

Reports unknown directives, missing colons, rules before any User-agent, deprecated Noindex usage, relative Sitemap URLs, paths missing a leading slash, and other authoring mistakes — each with the exact line number.

Sitemap Cross-Reference

All Sitemap directives are extracted into their own tab, validated for absolute-URL form, and rendered as clickable links so you can spot stale or unreachable references at a glance.

CSV Export Per Run

Drop the verdict table into a spreadsheet, an issue tracker, or a CI artifact. Columns include URL, user agent, verdict, matched rule type and path, and the line number where the rule lives in your robots.txt.

Why Choose This Tool?

Your robots.txt Never Leaves the Browser

robots.txt files routinely reference staging URLs, internal admin paths, gated content, and competitor watch lists you do not want sitting in a third-party SaaS log. Every byte of input stays in your browser memory — no upload, no API call, no logs.

Google Sunset Their Own Tester — This Fills the Gap

Search Console's robots.txt Tester was removed in 2024 with no replacement. This tool implements the same RFC 9309 rules Googlebot follows, so you can verify rule behavior without hitting a search engine's URL inspection API.

Tells You Which Line, Not Just That Something Is Wrong

Every verdict reports the exact rule that matched and its line number in the source file. Every syntax finding points to the offending line. Turn vague 'crawled but not indexed' Search Console warnings into precise one-line CMS or template fixes.

Open-Source Parser Logic

The parser ships in the open-source @anthropic-tools/tools-core library used by the REST API. You can audit the longest-match implementation, the user-agent specificity logic, and the syntax linter — no black-box scoring, no proprietary heuristics.

Robots.txt Done Right: A Practical Guide to Crawler Rules

What robots.txt Actually Controls

robots.txt is a public, per-host text file at /robots.txt that tells well-behaved crawlers which URLs they may fetch and which they may not. It is a request, not an enforcement: malicious crawlers ignore it entirely, and even well-behaved ones treat it as a hint, not a security boundary. The single most common misconception about robots.txt is that it controls indexing. It does not. It controls crawling. A URL that is disallowed in robots.txt can still appear in Google's index if Google learns about the URL from other signals (backlinks, sitemap entries, internal links from pages it can crawl). The result is the "Indexed, though blocked by robots.txt" status in Search Console — a URL with no snippet, no title, and effectively zero search performance, but still in the index. If you want a URL out of the index, use a noindex meta tag or X-Robots-Tag header. If you want a URL out of the crawl budget, use robots.txt. The two are different problems.

The RFC 9309 Matching Algorithm

The robots.txt spec was formalized as RFC 9309 in 2022, codifying the behavior Google had implemented for years. The algorithm is more nuanced than most teams realize:

  1. Group selection. The crawler scans every User-agent declaration and picks the group whose User-agent name is the most specific case-insensitive substring match for its own name. If no group matches, the wildcard User-agent: * group applies. If there is no wildcard either, every URL is allowed.
  2. Rule matching within the chosen group. Every Allow and Disallow rule whose path matches the request URL is collected. The path with the most literal characters wins — that is the "longest match" in RFC 9309 §2.2.2. The * wildcard matches any sequence of characters (including empty); the $ anchor pins the match to the end of the URL.
  3. Tie-breaking. When an Allow and a Disallow rule match with the same literal length, Allow wins. This is what makes "Disallow: / + Allow: /public/" produce the intuitive result that /public/anything is crawlable.

The tester implements this algorithm exactly. The Verdicts tab shows the matched rule's line number so you can verify behavior against your authored file rather than against an opaque "allowed/blocked" output.

The Five Rules Every robots.txt Author Should Internalize

  • Empty Disallow means allow all. Disallow: with no value is the canonical way to say "no restrictions for this group." It is not an error.
  • Paths must start with /. A rule like Disallow: admin/ is silently ignored by Google. The tester flags this as PATH_MISSING_LEADING_SLASH.
  • Trailing $ anchors to URL end. Disallow: /*.pdf$ blocks /report.pdf but not /report.pdf?download=1. The tester's wildcard example demonstrates exactly this.
  • Order does not matter within a group. Rules are not evaluated top-to-bottom. They are all evaluated and the longest match wins. Reordering rules to "fix" an unexpected result usually masks a deeper misunderstanding of the longest-match algorithm.
  • Adjacent User-agent declarations merge. User-agent: Googlebot\nUser-agent: Bingbot\nDisallow: /no-bots/ creates one group with two declared agents. A blank line or any non-User-agent directive between them creates two separate groups.

Directives That Are Not Actually Standards

Several robots.txt directives are widely used but not part of RFC 9309, and crawlers vary in their support:

  • Crawl-delay: Bingbot and Yandex honor it. Googlebot ignores it entirely. Use Search Console's crawl rate setting or Bing Webmaster Tools instead.
  • Host: Yandex-specific. Tells Yandex which mirror is canonical. Ignored by Google.
  • Noindex: Briefly experimentally supported by Google around 2019; officially unsupported since then. The tester flags this as NOINDEX_IN_ROBOTS with a recommendation to use <meta robots> or the X-Robots-Tag header instead. Even if a crawler happened to honor it once, relying on this is a regression waiting to happen.
  • Clean-param: Yandex-specific. Tells Yandex which URL parameters to ignore for canonicalization.
  • Request-rate and Visit-time: historical artifacts almost no modern crawler implements.

The tester recognizes all of these without lint noise (so the parse does not pollute the Findings tab with false negatives) but the deprecated Noindex is explicitly flagged because using it is actively dangerous — it gives authors a false sense that an URL is excluded from the index when in fact it remains indexable.

AI Crawlers: The New Audit Frontier

The user-agent picker includes GPTBot, ChatGPT-User, Claude-Web, and PerplexityBot precisely because controlling AI training and inference fetches is now part of every SEO and content team's robots.txt review. The mechanics are identical to traditional crawlers — robots.txt is a public file, the spec is the same, the user-agent matching follows the same rules — but the policy questions are new: do you want your content used as AI training data? Do you want it cited by AI search products? Each of those decisions is encoded as a per-agent rule in robots.txt, and the tester lets you verify the rule actually does what you think it does before you commit it.

Sitemap Directives: The Quiet Connector

The Sitemap directive is the one part of robots.txt that is not about crawling rules at all. It tells crawlers where to find your XML sitemap(s) — a host-relative or absolute URL pointing to a sitemap or sitemap index. Multiple Sitemap lines are allowed and additive. The tester parses every Sitemap directive into its own tab, validates that each is an absolute URL (the spec requires this), and renders the URLs as clickable links so you can spot stale references during the same audit pass. Pairing the robots.txt tester with the sitemap comparator on the same host is one of the fastest pre-deploy checks an SEO team can run.

What This Tool Does Not Do (And Why That Is Intentional)

The tester does not crawl your site. It does not fetch your robots.txt for you (CORS would block it and most browsers do not expose the response anyway). It does not check whether the URLs you test actually exist or return HTTP 200. Its job is narrow and well-scoped: given a robots.txt text and a list of URL strings, simulate the matching algorithm Googlebot uses, and report the verdict per user-agent. If you need a live crawl, use Screaming Frog or Sitebulb. If you need to test against the production robots.txt, fetch it yourself (browser, curl, or your CMS export) and paste it in. This separation keeps the tool predictable, fast, and offline-capable — which is the entire point of running it in the browser.

Privacy and Local Processing

robots.txt files often reference URLs you do not want logged by a third-party service: pre-launch routes, region-gated content, internal admin paths, and competitor watch lists. The tester runs entirely in your browser tab — paste, parse, test, export, all local. You can verify in DevTools that clicking Test triggers zero outbound requests. The same parser is available as an open-source library if you want to run it in CI, but the web tool itself is fully self-contained.

Frequently Asked Questions

Is my robots.txt sent to a server?

No. Parsing and URL matching run entirely in your browser tab using JavaScript loaded from a static site. You can confirm this in your browser's network panel — clicking Test triggers no outbound requests.

Does it follow the same rules Googlebot uses?

Yes. The parser implements RFC 9309: longest-match wins, Allow beats Disallow on ties, wildcards (*) match any sequence including empty, and $ anchors to end-of-URL. User-agent group selection follows Google's most-specific-substring rule with fallback to User-agent: *.

Why is my Disallow rule ignored?

Most often because the path is missing a leading slash. Disallow: admin/ is silently ignored; Disallow: /admin/ works. The tester flags this with PATH_MISSING_LEADING_SLASH. Other common causes: a syntax error elsewhere in the same group, the rule appearing before any User-agent declaration, or a longer Allow rule beating it on the same URL.

How does the user-agent matching work?

Group selection is a case-insensitive substring check. If your robots.txt declares User-agent: Googlebot-Image and you test as Googlebot-Image, that group applies. If you test as Googlebot, the Googlebot-Image group does not match — Googlebot ≠ Googlebot-Image — and the wildcard User-agent: * group applies instead.

Why is Crawl-delay flagged as ignored?

Googlebot does not honor Crawl-delay; it has not for years. Bingbot and Yandex still do. The tester reports this as an informational finding so you know the directive will not affect Google's crawl rate. To slow Googlebot, use Search Console's crawl rate setting.

Why does the linter complain about Noindex in robots.txt?

Noindex in robots.txt was an undocumented Google experiment that ended around 2019. It has been officially unsupported since then. Some other crawlers may honor it, but relying on it is dangerous because the behavior can change without notice. Use a <meta name="robots" content="noindex"> tag or the X-Robots-Tag HTTP header instead.

What format does the CSV export use?

Six columns: url, user_agent, allowed (true/false), matched_rule_type (allow/disallow/empty), matched_rule_path, matched_rule_line. Values containing commas, quotes, or newlines are properly quoted per RFC 4180. Use it as an issue-tracker artifact or as the input to a longer compliance audit.

Can it test URLs that are not on the same host as the robots.txt?

The matcher only inspects the path-and-query portion of each URL, so the host portion is ignored for the purpose of rule matching. In production, of course, robots.txt only applies to its own host — but the matcher itself does not enforce that, which makes the tool useful for testing rule logic in isolation.

Learn more