Reference
AI crawler user agents.
Every AI bot worth knowing, grouped by what it actually does: live answer fetches, search indexing, or model training. The same catalog MentionScout uses to identify and verify crawlers.
Three kinds of AI crawler
"AI crawler traffic" is really three different kinds of traffic, usually from three different user agents per provider. Each class identifies itself separately in your logs and honors its own robots.txt rule, so blocking one does not block the others. Getting the grouping right is the difference between opting out of model training and accidentally disappearing from AI answers altogether.
Live answer fetchers
These bots fetch a page at the moment a user asks an AI assistant something. There is no index in between: block one of these and your pages stop appearing in that assistant's live answers.
| User agent | Operator | What it does |
|---|---|---|
ChatGPT-User
|
OpenAI | Fetches pages live while a ChatGPT user waits for an answer. Blocking it removes your pages from ChatGPT's live answers. |
Claude-User
|
Anthropic | Claude's on-demand fetcher, used the moment a Claude user asks about your pages. |
Perplexity-User
|
Perplexity | Perplexity's on-demand fetcher for user queries. |
MistralAI-User
|
Mistral | On-demand fetches for Mistral's Le Chat. |
Grok-DeepSearch
|
xAI | Grok's DeepSearch fetches pages while researching an answer. |
AI search index crawlers
Classic-style crawlers that build the indexes behind AI search features, on a much tighter cycle than the old search mental model suggests: hours, not weeks.
| User agent | Operator | What it does |
|---|---|---|
OAI-SearchBot
|
OpenAI | Builds the index behind ChatGPT search. |
Claude-SearchBot
|
Anthropic | Builds the index behind Claude's web search. |
PerplexityBot
|
Perplexity | Builds Perplexity's search index. |
GoogleOther
|
Google's crawler for product teams outside classic Search, including AI features. Separate from Googlebot. | |
Amazonbot
|
Amazon | Feeds Alexa and Amazon's shopping assistant. |
DuckAssistBot
|
DuckDuckGo | Powers DuckDuckGo's DuckAssist answers. |
GrokBot
|
xAI | xAI's search crawler. Grok also fetches with ordinary browser user agents, so identified hits are a floor, not a ceiling. |
xAI-Grok
|
xAI | Alternate xAI search token, same caveat as GrokBot. |
Training crawlers
These collect content for training future models. Blocking them stops future training use but does not remove you from live answers or AI search. This group also holds the robots.txt-only control tokens.
| User agent | Operator | What it does |
|---|---|---|
GPTBot
|
OpenAI | Collects training data for OpenAI's future models. Does not affect ChatGPT answers today. |
ClaudeBot
|
Anthropic | Collects training data for Anthropic's models. |
Google-Extended
robots.txt token only
|
Not a crawler: a robots.txt token that controls whether Googlebot-crawled content may train Gemini. | |
Bytespider
|
ByteDance | ByteDance's training crawler. Historically aggressive and inconsistent about robots.txt. |
Meta-ExternalAgent
|
Meta | Training data for Meta AI. |
Applebot-Extended
robots.txt token only
|
Apple | Not a crawler: a robots.txt token controlling AI training use of Applebot's crawl. |
cohere-training-data-crawler
|
Cohere | Cohere's training-data crawler. |
cohere-ai
|
Cohere | Older Cohere crawler token, still seen in logs. |
What to put in robots.txt
The most common intent is "stay visible in AI answers, opt out of model training". That means leaving the live fetchers and search index crawlers alone and disallowing only the training group:
# Opt out of AI training, stay in AI answers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
The mistake to avoid is the blanket rule. A User-agent: * disallow, or a copied "block all AI bots" list, also turns away ChatGPT-User and OAI-SearchBot, and with them every citation and referral click those answers would have sent you.
Verifying that a crawler is real
User agent strings are trivially spoofed, and scrapers routinely impersonate GPTBot to borrow its reputation. OpenAI, Anthropic, Perplexity and Google publish the IP ranges their crawlers use, so a hit only counts if the source IP falls inside the published range. MentionScout does this check automatically and marks every visit verified or not; if you are reading raw logs, fetch the published JSON ranges and match before trusting a hit.
FAQ
Common questions
Should I block GPTBot?
Blocking GPTBot only opts your content out of training OpenAI's future models. It does not remove you from ChatGPT answers: those come through ChatGPT-User (live fetches) and OAI-SearchBot (search index). If AI visibility is a goal, most sites leave all three open. If you only want to opt out of training, block GPTBot alone and keep the other two.
What is the difference between GPTBot, OAI-SearchBot and ChatGPT-User?
Same company, three jobs. GPTBot collects training data for future models. OAI-SearchBot builds the search index ChatGPT browses. ChatGPT-User fetches pages live while a user waits for an answer. Each honors its own robots.txt rule, so you can allow answers while opting out of training.
Do AI crawlers respect robots.txt?
The major operators (OpenAI, Anthropic, Google, Perplexity, Apple) document their tokens and largely honor them. Some, Bytespider most notably, have a poor record. And Grok often fetches with ordinary browser user agents, which robots.txt cannot address. The only way to know what actually hits your site is to watch your logs.
How do I see which AI crawlers visit my site?
Grep your server logs for the user agents on this page, or connect your site to MentionScout and get every AI crawler hit on a timeline, IP-verified and split by purpose, with the citations and referral clicks that follow.
Start with the free AI visibility checkerSee which of these bots hit your site.
MentionScout shows every AI crawler visit on your pages, IP-verified and split by purpose, plus the citations and clicks that follow.