The Great AI Bot Blockade: Why Major News Publishers Are Shutting the Door on AI Crawlers
The tension between AI companies and content creators has reached a new inflection point. A comprehensive study by BuzzStream reveals that the overwhelming majority of major news publishers are actively blocking AI bots from accessing their content, but the situation is more nuanced than it first appears.
The Numbers Tell a Stark Story
According to BuzzStream’s analysis of robots.txt files from 100 top news sites across the United States and United Kingdom, 79% of publishers block at least one AI training bot. But here’s where things get interesting: 71% also block retrieval bots that AI tools use for real-time citations when answering user queries.
This distinction matters more than you might think. Training bots gather content to build AI models, while retrieval bots fetch content in real time when users ask questions. Publishers blocking retrieval bots won’t appear when AI assistants try to cite sources, even if their content was already used to train the underlying model.
Which Bots Are Getting Blocked?
The study categorized bots into three groups: training, retrieval/live search, and indexing. The blocking patterns reveal publishers’ priorities and concerns:
Training Bots: The Expected Targets
Common Crawl’s CCBot topped the blocked list at 75%, followed closely by Anthropic-ai at 72%, ClaudeBot at 69%, and GPTBot at 62%. These bots gather vast amounts of web content to train large language models.
Interestingly, Google-Extended was the least-blocked training bot at 46%. This bot trains Google’s Gemini model, and the geographic split is revealing: US publishers blocked it at 58%, nearly double the 29% rate among UK publishers. This disparity suggests different risk assessments or business relationships between American and British media companies.
Retrieval Bots: The Hidden Impact
While blocking training bots prevents future AI models from learning from your content, blocking retrieval bots has an immediate impact on visibility in AI-generated answers:
- Claude-Web: 66% blocked
- OpenAI’s OAI-SearchBot (powers ChatGPT live search): 49% blocked
- ChatGPT-User: 40% blocked
- Perplexity-User: 17% blocked (least blocked)
Publishers blocking these retrieval bots essentially opt out of being cited in AI search results, potentially sacrificing visibility for content protection.
The Robots.txt Problem: A “Please Keep Out” Sign Without a Lock
Here’s the uncomfortable truth that everyone in this space knows but rarely discusses: robots.txt is a directive, not enforcement. It’s essentially a polite request that bots are free to ignore.
Harry Clarkson-Bennett, SEO Director at The Telegraph, put it plainly: “The robots.txt file is a directive. It’s like a sign that says please keep out, but doesn’t stop a disobedient or maliciously wired robot. Lots of them flagrantly ignore these directives.”
This isn’t just theoretical concern. Cloudflare documented that Perplexity used stealth crawling tactics to bypass robots.txt restrictions, including rotating IP addresses, changing ASNs, and spoofing its user agent to appear as a legitimate browser. Cloudflare subsequently delisted Perplexity as a verified bot and now actively blocks it (though Perplexity disputed these claims).
For publishers serious about blocking AI crawlers, CDN-level blocking or bot fingerprinting may be necessary beyond simple robots.txt directives.
Why Are Publishers Blocking AI Bots?
Clarkson-Bennett summed up the publisher perspective: “Publishers are blocking AI bots using the robots.txt because there’s almost no value exchange. LLMs are not designed to send referral traffic and publishers (still!) need traffic to survive.”
This gets to the heart of the issue. Traditional search engines drove traffic to publisher sites, creating an ecosystem where content creators benefited from discovery. AI assistants, by contrast, synthesize information and answer questions directly, potentially eliminating the need for users to visit original sources.
Publishers face a difficult calculation: allow AI systems to use their content with minimal attribution and no traffic, or block them and risk becoming invisible in an AI-powered search future.
The Strategic Dilemma
Only 14% of sites blocked all AI bots tracked in the study, while 18% blocked none. The majority are making selective decisions about which bots to block, suggesting publishers are weighing multiple factors:
- Training vs. Retrieval: Do you block future model development, current citations, or both?
- Platform Relationships: Is blocking Google-Extended worth potentially damaging your relationship with the world’s largest search engine?
- Visibility Trade-offs: If AI assistants become primary search interfaces, does blocking them mean digital invisibility?
- Enforcement Reality: Given that robots.txt can be ignored, is blocking even effective?
What This Means for the Future
The retrieval bot blocking numbers deserve particular attention. Publishers aren’t just opting out of training future AI models—they’re opting out of the citation layer that AI search tools use to surface sources right now.
OpenAI separates its crawlers by function: GPTBot gathers training data, while OAI-SearchBot powers live search in ChatGPT. Perplexity makes a similar distinction between PerplexityBot for indexing and Perplexity-User for retrieval. Blocking one doesn’t block the other, which means publishers must make multiple, separate decisions about AI access.
Cloudflare’s Year in Review found that GPTBot, ClaudeBot, and CCBot had the highest number of full disallow directives across top domains. The report also noted that most publishers use partial blocks for Googlebot and Bingbot rather than full blocks, reflecting the dual role these crawlers play in both search indexing and AI training.
The Bottom Line
The landscape of AI and content is still being negotiated. Publishers are using the tools available to them (primarily robots.txt) to assert control over their content, even knowing those tools have significant limitations. Meanwhile, AI companies are developing separate bots for training and retrieval, adding complexity to publisher decisions.
For those tracking AI visibility, watch the retrieval bot category closely. Training blocks affect future models, while retrieval blocks affect whether your content appears in AI answers today.
As AI search becomes more prevalent, publishers face a choice: participate in an ecosystem where their content may be used without significant compensation or traffic, or risk becoming invisible in an AI-powered information landscape. Neither option is particularly appealing, which explains why blocking patterns vary so widely across publishers.
The only certainty is that this tension between content creators and AI companies isn’t going away anytime soon. The question isn’t whether the relationship will evolve, but how—and whether there’s a path forward that works for both sides.
What’s your take on this? Should publishers block AI bots, or is participation in AI search the new cost of digital visibility? Let us know in the comments.



nice!