seogeorobots.txtai-crawlerstechnical-guide

How to Set Up robots.txt Correctly for SEO and AI Crawlers

What Is robots.txt?

robots.txt is a plain text file placed at the root of your website that tells web crawlers which pages they are allowed to visit. It is part of the Robots Exclusion Protocol, a standard that has been in use since the mid-1990s.

When a crawler like Googlebot arrives at your site, the first thing it does is check https://example.com/robots.txt for instructions. Based on what it finds, the crawler decides which URLs to fetch and which to skip.

It is important to understand that robots.txt is a set of guidelines, not a security mechanism. Well-behaved crawlers from Google, Bing, OpenAI, and Anthropic will respect your directives. Malicious bots may ignore them entirely. If you need to protect sensitive content, use authentication or server-level access controls instead.

robots.txt Syntax

The syntax is straightforward. You specify a User-agent to target a crawler, then use Allow and Disallow to define the rules.

# Allow all crawlers to access the entire site
User-agent: *
Allow: /

# Point crawlers to the sitemap
Sitemap: https://example.com/sitemap.xml

Core Directives

DirectivePurpose
User-agentSpecifies which crawler the rules apply to
AllowPermits crawling of a specified path
DisallowBlocks crawling of a specified path
SitemapIndicates the location of your XML sitemap
Crawl-delaySets a delay between requests (not supported by Google)

User-agent: * is a wildcard that applies to all crawlers. You can create separate rule blocks for specific crawlers by using their exact names.

Common Configuration Patterns

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

This is the simplest and most common configuration. Unless you have a specific reason to block certain paths, start with this and add restrictions as needed.

Block Admin and Internal Paths

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /internal/
Disallow: /tmp/
Disallow: /checkout/

Sitemap: https://example.com/sitemap.xml

Pages that provide no value in search results, such as admin panels, API endpoints, and internal tools, should be excluded. This preserves your crawl budget for pages that matter.

Block Specific File Types

User-agent: *
Disallow: /*.pdf$
Disallow: /*.json$

You can use pattern matching to block entire file types. The * matches any string, and $ anchors the pattern to the end of the URL.

Block Everything (Staging Sites)

User-agent: *
Disallow: /

Use this only for staging or development environments that should never appear in search results. Accidentally deploying this to production is one of the most common and damaging robots.txt mistakes.

Managing AI Crawlers (GEO Perspective)

Since 2025, AI companies have been deploying their own crawlers to gather training data and power real-time search features. robots.txt is currently the primary mechanism for controlling AI crawler access.

Major AI Crawlers

CrawlerOperatorPurpose
GPTBotOpenAITraining data and API content
ChatGPT-UserOpenAIChatGPT browsing feature
ClaudeBotAnthropicTraining data for Claude
PerplexityBotPerplexityPerplexity AI search
BytespiderByteDanceAI training for TikTok and others
Google-ExtendedGoogleAI training for Gemini

Allowing AI Crawlers

If you want your content to appear in AI-powered search results and features like Google AI Overview, you should allow AI crawlers to access your site.

User-agent: *
Allow: /

# Explicitly allow AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://example.com/sitemap.xml

When User-agent: * already allows everything, the individual AI crawler entries are technically redundant. However, listing them explicitly signals your intent and makes your policy easy to understand at a glance.

Blocking Specific AI Crawlers

If you do not want your content used for AI training but still want it indexed by search engines, you can selectively block AI crawlers.

User-agent: *
Allow: /

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Sitemap: https://example.com/sitemap.xml

Be aware that blocking AI crawlers reduces your chances of being cited in AI search results and AI Overview panels. If improving your GEO score is a priority, keeping AI crawlers allowed is the better approach.

Setting Up and Testing robots.txt

File Location

robots.txt must be placed at the root of your domain. No other location will work.

https://example.com/robots.txt       ← Correct
https://example.com/files/robots.txt  ← Will not be read

Testing with Google Search Console

Google Search Console provides a robots.txt tester that lets you verify whether specific URLs are blocked or allowed by your rules. Always test before deploying changes to production.

Implementation in Next.js

If you are using Next.js with the App Router, you can generate robots.txt programmatically using a route file.

import { MetadataRoute } from 'next';

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
      },
      {
        userAgent: 'GPTBot',
        allow: '/',
      },
      {
        userAgent: 'ClaudeBot',
        allow: '/',
      },
    ],
    sitemap: 'https://example.com/sitemap.xml',
  };
}

This approach lets you dynamically adjust rules based on environment variables or other conditions.

Checking robots.txt with IndexReady

IndexReady's scoring tool evaluates your robots.txt configuration as part of the SEO category, with a maximum of 6 points. The check looks at whether:

  • A robots.txt file exists at the root of the domain
  • Basic directives are correctly formatted
  • A Sitemap directive is included

In addition, the GEO category includes an "AI Crawler Permissions" check worth up to 12 points. This evaluates whether major AI crawlers like GPTBot, ClaudeBot, and PerplexityBot are allowed or blocked in your robots.txt. Together, these checks give you a clear picture of how well your robots.txt supports both traditional search and AI discovery.

Common Mistakes to Avoid

  • Missing leading slash: Write Disallow: /admin/, not Disallow: admin. The leading slash is required for the path to be valid.
  • Confusing Disallow with noindex: Blocking a page in robots.txt prevents crawling, but the page can still appear in search results if other sites link to it. To prevent indexing, use a <meta name="robots" content="noindex"> tag instead.
  • Deploying staging rules to production: Leaving Disallow: / in your production robots.txt will remove your entire site from search results. Always verify robots.txt after deployment.
  • Blocking CSS and JavaScript: Modern search engines need access to your CSS and JS files to render pages correctly. Blocking these resources can hurt your rankings.
  • Overly broad rules: A single Disallow: /p will block /products/, /pricing/, /press/, and any other path starting with /p. Be precise with your patterns.

Priority Rules

When multiple rules match the same URL, crawlers follow the most specific rule. For example:

User-agent: *
Disallow: /blog/
Allow: /blog/public/

In this case, /blog/public/featured-post is allowed because /blog/public/ is more specific than /blog/. Google and Bing follow this specificity-based approach. When in doubt, test your rules with the Google Search Console robots.txt tester.

Summary

robots.txt is a small file with outsized impact. It shapes how search engines allocate their crawling resources across your site, and it is now the primary tool for managing AI crawler access. Getting it right improves crawl efficiency for SEO and opens the door to AI-powered discovery for GEO.

Start with a permissive configuration, block only what needs to be blocked, test thoroughly, and revisit your rules whenever your site structure or AI strategy changes.

FAQ

Does my site need a robots.txt file?

Technically, no. Without a robots.txt file, crawlers will assume they can access everything. However, having one allows you to declare your sitemap location, exclude irrelevant paths, and explicitly manage AI crawler access. Given that it takes minutes to set up, every site should have one.

What is the difference between robots.txt and the meta robots tag?

robots.txt controls crawling at the URL level, telling crawlers which paths they can visit. The meta robots tag controls indexing at the page level, telling search engines whether to include a page in their index. Use robots.txt to manage crawl budget and use meta robots to control which pages appear in search results.

Will blocking AI crawlers hurt my Google search rankings?

Blocking crawlers like GPTBot and ClaudeBot has no effect on your Google search rankings. However, blocking Google-Extended may affect how Google's AI features use your content. More importantly, blocking AI crawlers means your content is less likely to be cited in AI search results and AI Overview panels, which reduces your visibility in the growing AI search ecosystem.

How often should I update robots.txt?

Review your robots.txt whenever you make significant changes to your site structure, such as adding new sections, changing URL patterns, or updating your AI content strategy. As a general practice, audit it every three to six months. Also check it whenever a new major AI crawler appears on the scene.