Reference

AI crawler user agents.

Every AI bot worth knowing, grouped by what it actually does: live answer fetches, search indexing, or model training. The same catalog MentionScout uses to identify and verify crawlers.

Three kinds of AI crawler

"AI crawler traffic" is really three different kinds of traffic, usually from three different user agents per provider. Each class identifies itself separately in your logs and honors its own robots.txt rule, so blocking one does not block the others. Getting the grouping right is the difference between opting out of model training and accidentally disappearing from AI answers altogether.

Live answer fetchers

These bots fetch a page at the moment a user asks an AI assistant something. There is no index in between: block one of these and your pages stop appearing in that assistant's live answers.

User agent Operator What it does
ChatGPT-User OpenAI Fetches pages live while a ChatGPT user waits for an answer. Blocking it removes your pages from ChatGPT's live answers.
Claude-User Anthropic Claude's on-demand fetcher, used the moment a Claude user asks about your pages.
Perplexity-User Perplexity Perplexity's on-demand fetcher for user queries.
MistralAI-User Mistral On-demand fetches for Mistral's Le Chat.
Grok-DeepSearch xAI Grok's DeepSearch fetches pages while researching an answer.

AI search index crawlers

Classic-style crawlers that build the indexes behind AI search features, on a much tighter cycle than the old search mental model suggests: hours, not weeks.

User agent Operator What it does
OAI-SearchBot OpenAI Builds the index behind ChatGPT search.
Claude-SearchBot Anthropic Builds the index behind Claude's web search.
PerplexityBot Perplexity Builds Perplexity's search index.
GoogleOther Google Google's crawler for product teams outside classic Search, including AI features. Separate from Googlebot.
Amazonbot Amazon Feeds Alexa and Amazon's shopping assistant.
DuckAssistBot DuckDuckGo Powers DuckDuckGo's DuckAssist answers.
GrokBot xAI xAI's search crawler. Grok also fetches with ordinary browser user agents, so identified hits are a floor, not a ceiling.
xAI-Grok xAI Alternate xAI search token, same caveat as GrokBot.

Training crawlers

These collect content for training future models. Blocking them stops future training use but does not remove you from live answers or AI search. This group also holds the robots.txt-only control tokens.

User agent Operator What it does
GPTBot OpenAI Collects training data for OpenAI's future models. Does not affect ChatGPT answers today.
ClaudeBot Anthropic Collects training data for Anthropic's models.
Google-Extended robots.txt token only Google Not a crawler: a robots.txt token that controls whether Googlebot-crawled content may train Gemini.
Bytespider ByteDance ByteDance's training crawler. Historically aggressive and inconsistent about robots.txt.
Meta-ExternalAgent Meta Training data for Meta AI.
Applebot-Extended robots.txt token only Apple Not a crawler: a robots.txt token controlling AI training use of Applebot's crawl.
cohere-training-data-crawler Cohere Cohere's training-data crawler.
cohere-ai Cohere Older Cohere crawler token, still seen in logs.

What to put in robots.txt

The most common intent is "stay visible in AI answers, opt out of model training". That means leaving the live fetchers and search index crawlers alone and disallowing only the training group:

# Opt out of AI training, stay in AI answers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

The mistake to avoid is the blanket rule. A User-agent: * disallow, or a copied "block all AI bots" list, also turns away ChatGPT-User and OAI-SearchBot, and with them every citation and referral click those answers would have sent you.

Verifying that a crawler is real

User agent strings are trivially spoofed, and scrapers routinely impersonate GPTBot to borrow its reputation. OpenAI, Anthropic, Perplexity and Google publish the IP ranges their crawlers use, so a hit only counts if the source IP falls inside the published range. MentionScout does this check automatically and marks every visit verified or not; if you are reading raw logs, fetch the published JSON ranges and match before trusting a hit.

FAQ

Common questions

Should I block GPTBot?

Blocking GPTBot only opts your content out of training OpenAI's future models. It does not remove you from ChatGPT answers: those come through ChatGPT-User (live fetches) and OAI-SearchBot (search index). If AI visibility is a goal, most sites leave all three open. If you only want to opt out of training, block GPTBot alone and keep the other two.

What is the difference between GPTBot, OAI-SearchBot and ChatGPT-User?

Same company, three jobs. GPTBot collects training data for future models. OAI-SearchBot builds the search index ChatGPT browses. ChatGPT-User fetches pages live while a user waits for an answer. Each honors its own robots.txt rule, so you can allow answers while opting out of training.

Do AI crawlers respect robots.txt?

The major operators (OpenAI, Anthropic, Google, Perplexity, Apple) document their tokens and largely honor them. Some, Bytespider most notably, have a poor record. And Grok often fetches with ordinary browser user agents, which robots.txt cannot address. The only way to know what actually hits your site is to watch your logs.

How do I see which AI crawlers visit my site?

Grep your server logs for the user agents on this page, or connect your site to MentionScout and get every AI crawler hit on a timeline, IP-verified and split by purpose, with the citations and referral clicks that follow.

Start with the free AI visibility checker

See which of these bots hit your site.

MentionScout shows every AI crawler visit on your pages, IP-verified and split by purpose, plus the citations and clicks that follow.