How to Set Up robots.txt Correctly for SEO and AI Crawlers
What Is robots.txt?
robots.txt is a plain text file placed at the root of your website that tells web crawlers which pages they are allowed to visit. It is part of the Robots Exclusion Protocol, a standard that has been in use since the mid-1990s.
When a crawler like Googlebot arrives at your site, the first thing it does is check https://example.com/robots.txt for instructions. Based on what it finds, the crawler decides which URLs to fetch and which to skip.
It is important to understand that robots.txt is a set of guidelines, not a security mechanism. Well-behaved crawlers from Google, Bing, OpenAI, and Anthropic will respect your directives. Malicious bots may ignore them entirely. If you need to protect sensitive content, use authentication or server-level access controls instead.
robots.txt Syntax
The syntax is straightforward. You specify a User-agent to target a crawler, then use Allow and Disallow to define the rules.
# Allow all crawlers to access the entire site
User-agent: *
Allow: /
# Point crawlers to the sitemap
Sitemap: https://example.com/sitemap.xml
Core Directives
| Directive | Purpose |
|---|---|
User-agent | Specifies which crawler the rules apply to |
Allow | Permits crawling of a specified path |
Disallow | Blocks crawling of a specified path |
Sitemap | Indicates the location of your XML sitemap |
Crawl-delay | Sets a delay between requests (not supported by Google) |
User-agent: * is a wildcard that applies to all crawlers. You can create separate rule blocks for specific crawlers by using their exact names.
Common Configuration Patterns
Allow Everything (Recommended Default)
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
This is the simplest and most common configuration. Unless you have a specific reason to block certain paths, start with this and add restrictions as needed.
Block Admin and Internal Paths
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /internal/
Disallow: /tmp/
Disallow: /checkout/
Sitemap: https://example.com/sitemap.xml
Pages that provide no value in search results, such as admin panels, API endpoints, and internal tools, should be excluded. This preserves your crawl budget for pages that matter.
Block Specific File Types
User-agent: *
Disallow: /*.pdf$
Disallow: /*.json$
You can use pattern matching to block entire file types. The * matches any string, and $ anchors the pattern to the end of the URL.
Block Everything (Staging Sites)
User-agent: *
Disallow: /
Use this only for staging or development environments that should never appear in search results. Accidentally deploying this to production is one of the most common and damaging robots.txt mistakes.
Managing AI Crawlers (GEO Perspective)
Since 2025, AI companies have been deploying their own crawlers to gather training data and power real-time search features. robots.txt is currently the primary mechanism for controlling AI crawler access.
Major AI Crawlers
| Crawler | Operator | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training data and API content |
| ChatGPT-User | OpenAI | ChatGPT browsing feature |
| ClaudeBot | Anthropic | Training data for Claude |
| PerplexityBot | Perplexity | Perplexity AI search |
| Bytespider | ByteDance | AI training for TikTok and others |
| Google-Extended | AI training for Gemini |
Allowing AI Crawlers
If you want your content to appear in AI-powered search results and features like Google AI Overview, you should allow AI crawlers to access your site.
User-agent: *
Allow: /
# Explicitly allow AI crawlers
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://example.com/sitemap.xml
When User-agent: * already allows everything, the individual AI crawler entries are technically redundant. However, listing them explicitly signals your intent and makes your policy easy to understand at a glance.
Blocking Specific AI Crawlers
If you do not want your content used for AI training but still want it indexed by search engines, you can selectively block AI crawlers.
User-agent: *
Allow: /
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Sitemap: https://example.com/sitemap.xml
Be aware that blocking AI crawlers reduces your chances of being cited in AI search results and AI Overview panels. If improving your GEO score is a priority, keeping AI crawlers allowed is the better approach.
Setting Up and Testing robots.txt
File Location
robots.txt must be placed at the root of your domain. No other location will work.
https://example.com/robots.txt ← Correct
https://example.com/files/robots.txt ← Will not be read
Testing with Google Search Console
Google Search Console provides a robots.txt tester that lets you verify whether specific URLs are blocked or allowed by your rules. Always test before deploying changes to production.
Implementation in Next.js
If you are using Next.js with the App Router, you can generate robots.txt programmatically using a route file.
import { MetadataRoute } from 'next';
export default function robots(): MetadataRoute.Robots {
return {
rules: [
{
userAgent: '*',
allow: '/',
},
{
userAgent: 'GPTBot',
allow: '/',
},
{
userAgent: 'ClaudeBot',
allow: '/',
},
],
sitemap: 'https://example.com/sitemap.xml',
};
}
This approach lets you dynamically adjust rules based on environment variables or other conditions.
Checking robots.txt with IndexReady
IndexReady's scoring tool evaluates your robots.txt configuration as part of the SEO category, with a maximum of 6 points. The check looks at whether:
- A robots.txt file exists at the root of the domain
- Basic directives are correctly formatted
- A Sitemap directive is included
In addition, the GEO category includes an "AI Crawler Permissions" check worth up to 12 points. This evaluates whether major AI crawlers like GPTBot, ClaudeBot, and PerplexityBot are allowed or blocked in your robots.txt. Together, these checks give you a clear picture of how well your robots.txt supports both traditional search and AI discovery.
Common Mistakes to Avoid
- Missing leading slash: Write
Disallow: /admin/, notDisallow: admin. The leading slash is required for the path to be valid. - Confusing Disallow with noindex: Blocking a page in robots.txt prevents crawling, but the page can still appear in search results if other sites link to it. To prevent indexing, use a
<meta name="robots" content="noindex">tag instead. - Deploying staging rules to production: Leaving
Disallow: /in your production robots.txt will remove your entire site from search results. Always verify robots.txt after deployment. - Blocking CSS and JavaScript: Modern search engines need access to your CSS and JS files to render pages correctly. Blocking these resources can hurt your rankings.
- Overly broad rules: A single
Disallow: /pwill block/products/,/pricing/,/press/, and any other path starting with/p. Be precise with your patterns.
Priority Rules
When multiple rules match the same URL, crawlers follow the most specific rule. For example:
User-agent: *
Disallow: /blog/
Allow: /blog/public/
In this case, /blog/public/featured-post is allowed because /blog/public/ is more specific than /blog/. Google and Bing follow this specificity-based approach. When in doubt, test your rules with the Google Search Console robots.txt tester.
Summary
robots.txt is a small file with outsized impact. It shapes how search engines allocate their crawling resources across your site, and it is now the primary tool for managing AI crawler access. Getting it right improves crawl efficiency for SEO and opens the door to AI-powered discovery for GEO.
Start with a permissive configuration, block only what needs to be blocked, test thoroughly, and revisit your rules whenever your site structure or AI strategy changes.
FAQ
Does my site need a robots.txt file?
Technically, no. Without a robots.txt file, crawlers will assume they can access everything. However, having one allows you to declare your sitemap location, exclude irrelevant paths, and explicitly manage AI crawler access. Given that it takes minutes to set up, every site should have one.
What is the difference between robots.txt and the meta robots tag?
robots.txt controls crawling at the URL level, telling crawlers which paths they can visit. The meta robots tag controls indexing at the page level, telling search engines whether to include a page in their index. Use robots.txt to manage crawl budget and use meta robots to control which pages appear in search results.
Will blocking AI crawlers hurt my Google search rankings?
Blocking crawlers like GPTBot and ClaudeBot has no effect on your Google search rankings. However, blocking Google-Extended may affect how Google's AI features use your content. More importantly, blocking AI crawlers means your content is less likely to be cited in AI search results and AI Overview panels, which reduces your visibility in the growing AI search ecosystem.
How often should I update robots.txt?
Review your robots.txt whenever you make significant changes to your site structure, such as adding new sections, changing URL patterns, or updating your AI content strategy. As a general practice, audit it every three to six months. Also check it whenever a new major AI crawler appears on the scene.