AI Crawlers

A reference guide to the major AI crawlers currently active on the web — what each one does, and how to control their access through robots.txt.

Why This Matters

Each AI company operates its own crawler (or crawlers) with a distinct user-agent name. Controlling access at a per-crawler level — rather than a single blanket rule — lets you make deliberate choices about which AI systems can train on, search, or cite your content. See Content Signals for how to express more granular preferences per crawler.

OpenAI / ChatGPT

User-AgentPurpose
GPTBotCrawls content for model training
OAI-SearchBotPowers ChatGPT's search/browsing features
ChatGPT-UserFetches pages in real time when a user asks ChatGPT to browse a specific site

Anthropic / Claude

User-AgentPurpose
ClaudeBotGeneral crawling, including potential training use
Claude-SearchBotPowers Claude's web search feature
Claude-UserFetches pages in real time when a user asks Claude to browse a specific site

Perplexity

User-AgentPurpose
PerplexityBotCrawls and indexes content to power Perplexity's answer engine

Google AI

User-AgentPurpose
Google-ExtendedControls whether Google's AI products (AI Overviews, Gemini) may use your content, separately from standard Googlebot search indexing

Note: Google-Extended does not control regular Google Search crawling — that's governed by the standard Googlebot rules in your robots.txt.

Meta AI

User-AgentPurpose
FacebookBotCrawls content related to Meta's AI products

Apple

User-AgentPurpose
ApplebotPowers Siri, Spotlight, and Apple Intelligence features

Amazon

User-AgentPurpose
AmazonbotCrawls for Alexa and Amazon's AI systems

Cohere

User-AgentPurpose
cohere-aiUsed in enterprise retrieval-augmented generation (RAG) pipelines

Example robots.txt Block

User-agent: GPTBot
Allow: /llms.txt
Allow: /semantic/
Allow: /markdown/
Content-Signal: ai-train=no, search=yes, ai-input=yes

User-agent: ClaudeBot
Allow: /llms.txt
Allow: /semantic/
Allow: /markdown/
Content-Signal: ai-train=no, search=yes, ai-input=yes

See robots.txt for the complete directive structure across all crawlers.

Standard (Non-AI) Search Crawlers

For comparison, these are the traditional search engine crawlers most sites already allow:

User-AgentEngine
GooglebotGoogle Search
BingbotBing
SlurpYahoo
DuckDuckBotDuckDuckGo

Verifying Crawler Activity

To see which of these crawlers are actually visiting your site:

  • Google Search Console → Settings → Crawl Stats → "Other agent type" often surfaces non-Googlebot activity
  • Server access logs → filter by user-agent string for direct evidence
  • isitagentready.com → automated scan of your bot access configuration

See Validation for the full process.

This List Will Change

New AI crawlers appear regularly as the field evolves. AIA Matrix keeps its generated robots.txt templates updated as new, significant crawlers emerge — Professional plan users receive these updates automatically with each re-scan.

Related