AI Crawler robots.txt Guide
Which Bots to Allow
Your robots.txt file is no longer just about Googlebot. AI crawlers from OpenAI, Anthropic, Perplexity, and others are visiting your site daily. This guide covers every major AI bot and how to make the right allow/block decisions.
The New Crawler Landscape
In 2024, most websites had two or three search engine crawlers to think about: Googlebot, Bingbot, and maybe Yandex. By 2026, there are over a dozen AI-specific crawlers hitting websites regularly, each with different purposes and behaviors.
Some crawlers index content for training data. Others fetch content in real-time to answer user queries. Some do both. The distinction matters because blocking a training crawler is very different from blocking a query-time crawler. Block the wrong one and you disappear from AI-powered search results.
Your robots.txt is the first file every well-behaved crawler reads. If you have not updated it since the AI era began, you are either blocking traffic you want or allowing access you did not intend.
Major AI Crawlers Reference
| Bot Name | Operator | Purpose | Recommendation |
|---|---|---|---|
| GPTBot | OpenAI | Training + browsing | Allow (critical for ChatGPT visibility) |
| ChatGPT-User | OpenAI | Real-time browsing only | Allow (live search queries) |
| ClaudeBot | Anthropic | Training data | Allow (improves Claude citations) |
| anthropic-ai | Anthropic | Product features | Allow |
| PerplexityBot | Perplexity | Real-time search answers | Allow (cited source traffic) |
| Google-Extended | Gemini/Bard training | Allow (for AI Overview inclusion) | |
| Amazonbot | Amazon | Alexa/product answers | Optional (depends on audience) |
| Meta-ExternalAgent | Meta | AI training | Optional (Meta AI features) |
| CCBot | Common Crawl | Open training datasets | Optional (feeds many AI models) |
Recommended robots.txt for AI-Friendly Sites
Here is a robots.txt configuration that maximizes AI agent discoverability while protecting sensitive paths:
# Search engines User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / # AI Crawlers — Allow for maximum agent visibility User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: ClaudeBot Allow: / User-agent: anthropic-ai Allow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: / # Protect sensitive paths from all crawlers User-agent: * Disallow: /admin/ Disallow: /api/internal/ Disallow: /dashboard/ # Point to sitemap and AI discovery files Sitemap: https://example.com/sitemap.xml
Key principle: Be explicit. Do not rely on the wildcard User-agent: * rule for AI bots. Declare each bot you want to allow by name. This gives you granular control and makes your intent clear.
The Allow vs Block Decision Framework
The decision to allow or block an AI crawler depends on your business model and content strategy. Here is a framework:
Allow if: You want your content cited in AI-powered answers. You want agents to be able to complete tasks on your site (sign up, purchase, compare). You want to appear in AI search results from Perplexity, ChatGPT, and Google AI Overviews.
Consider blocking if: Your revenue depends entirely on page views (blocking training crawlers but allowing query-time crawlers can be a middle ground). You have premium content behind a paywall that you do not want summarized for free.
The reality: For most businesses, blocking AI crawlers is like blocking Googlebot in 2005. You are opting out of the primary way people will discover products and services going forward. The sites that allow AI crawlers now are building the same first-mover advantage that early SEO adopters captured.
Common robots.txt Mistakes
Blanket Disallow: /
A wildcard Disallow: / blocks ALL crawlers including AI bots. This is the nuclear option and almost never what you want.
Blocking GPTBot but allowing ChatGPT-User
GPTBot feeds ChatGPT's knowledge base. Blocking it means ChatGPT has outdated or no information about you, even if ChatGPT-User can browse.
No AI-specific rules at all
Relying on the wildcard rule means you cannot differentiate between AI bots. Add explicit rules for each major AI crawler.
Forgetting query-time crawlers
PerplexityBot and ChatGPT-User fetch content when users ask questions. Blocking them means you get zero AI search traffic.
robots.txt and Your AX Score
robots.txt is the first check in the Discoverability dimension (20% of AX score). If your robots.txt blocks major AI crawlers, your Discoverability score drops significantly — often by 8-10 points. The audit checks each major AI bot individually and reports which are allowed and which are blocked.
Check Your robots.txt With AX Audit
The AX Audit fetches your robots.txt, checks permissions for every major AI crawler, and tells you exactly which bots are allowed and which are blocked.
Check Your robots.txt With AX AuditFree audit. No signup required.