Skip to main content
Security Updated 20 February 2026 8 min read Originally published January 2026

AI Crawlers Violate robots.txt on 72% of UK Sites, Cloudflare Data Shows

Cloudflare's new robots.txt compliance tracker reveals which AI crawlers respect your rules and which ignore them. Testing across 47 UK business sites: 72% had violations. Sites with AI discovery files saw 43% fewer.

MM
Mark McNeece Founder & Managing Director, 365i
Cloudflare dashboard showing AI crawler robots.txt violation data across multiple UK business websites

Cloudflare released a robots.txt compliance tracking feature on 21 October 2025 that does something no tool has done before: it shows you exactly which AI crawlers respect your website's access rules, and which ignore them entirely.

We tested it across 47 UK business websites between 25 October and 15 November 2025. The results? 72% of sites had AI crawlers violating their robots.txt directives. Not once or twice. On average, 156 violation requests per site over three weeks.

For businesses relying on robots.txt to keep customer data, pricing information, and internal documentation out of AI training datasets, this isn't a theoretical problem. It's happening right now. If you're not sure whether your robots.txt is even configured correctly, our free Robots.txt Checker parses every rule and flags common mistakes.

What Cloudflare's Data Revealed

The new Robots.txt tab in Cloudflare's AI Crawl Control dashboard monitors every AI crawler request against your robots.txt rules. It tracks total requests, HTTP status codes, whether files contain Content Signals directives, and the exact crawlers requesting paths you've explicitly blocked.

For each violation, you see the crawler operator, the violated path, which directive was ignored, and how many times. That level of detail didn't exist before October 2025.

"When you identify non-compliant crawlers, you can block the crawler in the Crawlers tab, create custom WAF rules for path-specific security, or use Redirect Rules to guide crawlers to appropriate areas of your site."

Cloudflare, AI Crawl Control Changelog, 21 October 2025

I remember reading through those first few days of compliance data and being properly taken aback. We'd assumed robots.txt was working. The data said otherwise. After running a hosting company since 2001, you'd think very little about crawler behaviour would surprise me. This did. If you're not sure what your own robots.txt actually tells crawlers, our free Robots.txt Checker shows you exactly what they see.

Cloudflare also introduced custom HTTP 402 "Payment Required" responses for paid plan customers back in August 2025. Instead of blocking AI crawlers with a standard 403 Forbidden, you can return a 402 with your licensing contact details. Several agencies are already using this for content licensing negotiations.

What 47 UK Business Sites Showed Us

We selected 47 sites from our existing client base, all on WordPress hosting with Cloudflare integration. Every site had active robots.txt files explicitly blocking AI crawlers from directories containing sensitive content. Then we watched the dashboard for three weeks.

The findings:

  • 34 out of 47 sites (72%) recorded at least one AI crawler violation of explicit Disallow rules
  • 89% of violations targeted paths containing customer data, pricing structures, or internal documentation
  • Three specific AI crawlers accounted for 67% of all recorded violations
  • 156 requests per site on average over the 21-day monitoring period
  • Sites with AI discovery files (llms.txt, ai.txt, brand.txt) experienced 43% fewer violation attempts

One case stood out. A Manchester law firm specialising in family law had blocked their entire /client-portal/ directory through robots.txt. Over three weeks, AI crawlers made 847 requests to URLs within that blocked directory. Not the directory root. Specific pages deep within the restricted path. That pattern suggests systematic, not accidental, access.

Why This Matters Beyond Technical Curiosity

Three areas where AI crawler violations create real problems for UK businesses.

Competitive intelligence. If AI crawlers are accessing your pricing pages despite robots.txt blocks, that data potentially feeds into AI training datasets. Competitors asking ChatGPT "what does [your company] charge for X?" might get answers you'd rather they didn't have.

GDPR compliance. The regulation requires "appropriate technical and organisational measures" to protect personal data. If AI crawlers are accessing customer data despite your robots.txt blocks, relying only on robots.txt may not count as "appropriate measures" under UK GDPR. That's a legal question worth raising with your data protection officer.

Content licensing. Creative agencies and publishers whose work appears in AI training data have limited options if their primary defence (robots.txt) is being ignored. Cloudflare's HTTP 402 response is one option, but enforcement remains an open question. The House of Lords AI copyright report is now pushing for a licensing-first framework that would give website owners real legal teeth.

"Robots.txt and llms.txt have different purposes. robots.txt is generally used to let automated tools know what access to a site is considered acceptable... llms.txt information will often be used on demand when a user explicitly requests information."

llmstxt.org, Official Specification

This distinction crystallised something I'd been mulling over for months. We'd been treating robots.txt and AI discovery files as the same category of thing. They're not. robots.txt says "don't go here." AI discovery files say "here's what you should know about us." One is a lock. The other is an introduction. They're separate layers of web infrastructure, each solving a different problem in the same way sitemaps complemented robots.txt. Effective protection probably needs both.

Three Protection Strategies That Work

Based on our three-week monitoring across all 47 sites, three approaches showed measurable results.

1. AI discovery files alongside robots.txt. Sites with proper AI discovery files (llms.txt, ai.txt, brand.txt, and the other seven) saw 43% fewer crawler violations than sites with only robots.txt. It appears AI systems respect sites that speak their language, even when they ignore traditional access rules. Our step-by-step guide to creating AI files covers the implementation details.

2. WAF rules for enforcement. robots.txt is a polite request. Web Application Firewall rules are actual enforcement: the request gets blocked before any content is served. For paths containing data you absolutely cannot have crawled (customer portals, pricing engines, internal documentation), WAF rules are the answer. Our secure hosting infrastructure includes this protection by default.

3. Tiered access controls. Not all content needs the same protection level. Public marketing pages can be crawled freely. Service descriptions should be available to AI via discovery files. Customer data and proprietary pricing need WAF-level blocking. Matching protection levels to content sensitivity is more effective than blanket blocking.

The AI Site Identity Connection

The 43% reduction in violations for sites with AI discovery files is the most interesting finding from our testing. Why would having llms.txt and similar files reduce robots.txt violations?

One theory: AI systems that find structured, accessible identity information in dedicated files have less reason to crawl restricted areas. If your llms.txt clearly states your services, pricing approach, and target market, the AI system already has what it needs. It doesn't need to go hunting through blocked directories for context.

Think of it as giving AI what it wants through the front door, so it doesn't try the back door. If you're curious about why ChatGPT can't find your website in the first place, that article covers the broader visibility problem.

We've deployed AI discovery files across 60+ client sites over six months. The correlation between file deployment and reduced crawler violations has been consistent, though correlation isn't causation and the sample is limited to our own client base.

The free AI Visibility Checker can show you what AI systems currently understand about your business, and where your files need work. It's a good starting point regardless of which protection strategy you choose.

Frequently Asked Questions

Is robots.txt enough to protect my content from AI crawlers?

No. Our testing showed 72% of UK sites had robots.txt violations from AI crawlers. You need robots.txt combined with WAF rules for enforcement on sensitive paths, plus AI discovery files to give crawlers a legitimate alternative information source.

Which AI crawlers are the worst offenders?

Cloudflare's dashboard shows this data per-site. Our testing found three specific crawlers accounted for 67% of all violations, with an average 156 requests per site over three weeks. The Cloudflare dashboard shows you exactly which crawlers are violating your rules.

How does AI Site Identity reduce crawler violations?

Sites with proper AI discovery files saw 43% fewer crawler violations in our testing. AI systems that find structured, accessible information in dedicated files have less reason to crawl restricted areas. You're giving AI what it needs through the front door.

What is the HTTP 402 "Payment Required" response for AI crawlers?

Cloudflare introduced this for paid plan customers in August 2025. Instead of blocking AI crawlers with 403 Forbidden, you can return 402 with your licensing contact information. This creates a direct communication channel with crawler operators about content licensing terms.

How can I check if AI crawlers are violating my robots.txt?

If you're on Cloudflare, go to AI Crawl Control, then the Robots.txt tab. It shows which crawlers are requesting blocked paths, how many times, and which directives they're ignoring. Without Cloudflare, check your server logs for user agents containing "bot", "crawler", or "GPT" requesting paths blocked in your robots.txt.

What's the difference between WAF rules and robots.txt?

robots.txt is a polite request that crawlers can ignore (and many do). WAF rules are actual enforcement: the request gets blocked at the server level before any content is accessed. Use WAF rules for sensitive content that absolutely cannot be crawled, and robots.txt plus AI discovery files for general guidance.

Does AI crawler access to blocked content create GDPR issues?

Potentially, yes. GDPR requires "appropriate technical and organisational measures" to protect personal data. If AI crawlers are accessing customer data despite your robots.txt blocks, relying only on robots.txt may not count as appropriate measures. Implement WAF rules, access controls, and AI discovery files to demonstrate reasonable protection efforts.

Protect Your WordPress Site from Rogue Crawlers

365i's managed WordPress hosting includes Cloudflare integration, WAF protection, and infrastructure built for both human visitors and AI crawlers. Pair it with proper AI discovery files for the best protection.

Explore Secure Hosting

Sources