How to Allow or Block AI Crawlers in robots.txt (2026 Guide)
What Are AI Crawlers?
AI crawlers are automated programs used by AI search engines like ChatGPT, Perplexity, and Claude to gather information from websites. Like traditional crawlers such as Googlebot, they respect robots.txt rules when visiting your site.
As of 2026, every major AI service operates its own crawler, and you can control each one individually through robots.txt.
Complete AI Crawler List
| Crawler Name | Operator | Purpose |
|---|---|---|
| GPTBot | OpenAI | ChatGPT training and search |
| ChatGPT-User | OpenAI | ChatGPT browsing feature |
| ClaudeBot | Anthropic | Claude training and search |
| anthropic-ai | Anthropic | Anthropic general crawler |
| PerplexityBot | Perplexity | Perplexity search |
| Google-Extended | Gemini training | |
| GoogleOther | Google AI Overview experiments | |
| CCBot | Common Crawl | Open dataset (used by many AI models) |
| Bytespider | ByteDance | TikTok/ByteDance AI |
| Applebot-Extended | Apple | Apple Intelligence training |
| cohere-ai | Cohere | Cohere AI training |
| Diffbot | Diffbot | Knowledge graph building |
How to Allow AI Crawlers
For your content to appear in AI search results, you need to allow AI crawlers to access your site. Here's a configuration that allows all major AI crawlers:
# Allow major AI crawlers
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: GoogleOther
Allow: /
User-agent: CCBot
Allow: /
User-agent: Applebot-Extended
Allow: /
# Traditional search engines
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Note that User-agent: * with Allow: / alone may not be enough. Some AI crawlers only crawl when explicitly permitted under their own User-agent name.
How to Block AI Crawlers
If you don't want certain AI crawlers using your content, use Disallow rules.
Block a specific AI crawler
# Block only GPTBot (allow everything else)
User-agent: GPTBot
Disallow: /
User-agent: *
Allow: /
Block all AI crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
# Allow regular search engines
User-agent: *
Allow: /
Block specific paths only
# Block AI crawlers from /private/ only
User-agent: GPTBot
Disallow: /private/
Allow: /
User-agent: ClaudeBot
Disallow: /private/
Allow: /
Should You Allow or Block?
Allow when
- Blog or media site — AI citations can drive traffic
- Tool or service site — Chance to be recommended in AI search results
- E-E-A-T focused content — Being cited by AI can signal authority
- GEO optimization — Blocking AI crawlers makes GEO efforts pointless
Consider blocking when
- Paywalled content — AI answers could reduce subscription incentive
- Proprietary data — Risk of competitors using your original research
- Copyright concerns — You don't want content used for AI training
Partial control is usually best
For most sites, the pragmatic approach is selective access rather than all-or-nothing. Allow AI crawlers on public content while blocking premium, private, or admin pages.
How to Verify Your robots.txt
Direct browser access
Visit https://yoursite.com/robots.txt to see the current configuration.
Google Search Console
Use the robots.txt tester in Google Search Console to check if specific URLs are blocked. Note that there's no AI-crawler-specific tester available yet.
IndexReady
IndexReady's GEO scoring automatically checks whether major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) are allowed in your robots.txt. It gives you a clear status of your AI crawler permissions.
robots.txt vs llms.txt
robots.txt is often confused with llms.txt. They serve different purposes:
| robots.txt | llms.txt | |
|---|---|---|
| Purpose | Control crawl access | Provide site info to AI |
| Scope | All crawlers | AI/LLMs only |
| Effect | Access control | Improves content understanding |
| Priority | Essential for any site | Recommended for GEO |
The best approach: allow AI crawlers in robots.txt, then provide context in llms.txt. Together, these maximize your visibility in AI search.
Frequently Asked Questions (FAQ)
Does blocking AI crawlers in robots.txt prevent AI training?
Major AI companies (OpenAI, Anthropic, Google) have stated they respect robots.txt rules. However, robots.txt is not technically enforceable — it's a voluntary standard. For legal protection, consider explicit terms of service and access controls.
Does User-agent: * cover AI crawlers too?
User-agent: * applies to all crawlers by default. However, some AI crawlers prioritize rules written specifically for their User-agent name over the wildcard rule. For reliable control, explicitly list each AI crawler's User-agent name.
Can I block Google AI Overview specifically?
Google AI Overview uses data crawled by regular Googlebot, so you can't block it through robots.txt alone. Blocking Google-Extended prevents Gemini from using your content for training, but doesn't affect AI Overview. Use the nosnippet meta tag to prevent AI Overview citations.
How long does it take for robots.txt changes to take effect?
It depends on the crawler. Googlebot typically detects changes within 24 hours. AI crawler refresh intervals aren't officially documented — allow a few days to a week for changes to fully propagate.