Key take‑away: Sustainable defense begins with understanding why automated AI scrapers bombard sites, what they can do with what they take, and which layered counter‑measures best fit your risk profile, budget, and values. WAF mitigation is one option; it is not the only tool in your arsenal.
Driver | What It Means for Site Owners |
---|---|
Data‑hungry model training | Frontier‑scale models still demand trillions of fresh tokens. Open‑source and closed labs race to widen their corpora faster than licensing can keep up. |
Shadow‑copy competitors | Smaller players purchase or rent scraping APIs to clone niche knowledge bases, price catalogues, or feature descriptions. |
Search‑engine disruption | Generative answers divert traffic away from origin sites, raising the incentive to ingest entire verticals for RAG pipelines. |
Weak legal deterrents | Copyright doctrines around text‑and‑data mining remain unsettled in many jurisdictions, lowering the perceived cost of infringement. |
Scraping is no longer a random botnet pastime. It is now backed by venture money, GPU budgets, and service providers who sell scraping-as‑a‑service. Blocking requires more than rate‑limiting; it requires a multifaceted governance strategy.
On 12 May 2025 we noticed an unusual pattern in our alerts and logs for one of our customer websites:
Why we believe it was an AI scraper, not a classical DDoS:
We could tell this was a targeted content scrape, likely for use in AI model training or competitive aggregation. If the site’s content later appears inside an LLM prompt with no attribution, our client’s brand voice dilutes, and their SEO moat narrows, not to mention the data mining implications.
We’re software developers, but our mission doesn’t end with deliverables. We worry about our clients, many of whom make a living creating original content.
And we’re not alone - across the web, many firms just like ours have been called in to rescue all types of sites that have gone under from aggressive AI scraping.
This incident is just a small reflection of a global shift.
According to Cloudflare’s Q1 2025 DDoS Threat Report, the Internet is observing increasing levels of automated abuse, much of which is powered by AI-enhanced botnets. Here are the highlights:
An in-depth discussion of the report is in Episode 92 of This Week in NET by Cloudflare Engineering:
🎥 Watch: Cloudflare DDoS Report: Episode 92 of This Week in NET
While our incident wasn’t a DDoS, the scraping behavior we saw mirrored many of the same tactics. The automation no longer feels scrappy and brute-forced—it is optimized, trained, and harder to detect. These increasingly sophisticated attacks coincide with the dramatic advancements in AI.
This highlights a broader pattern: Automation and AI are transforming even the threat landscape.
Now that AI vendors are routinely trawling web content to feed ever-growing large language models, a full spectrum of responses is taking shape.
Publishers, independent creators, and readers are pushing back to stop their words, images, and raw data from being used as training fuel for AI.
On the standards side, the Internet Engineering Task Force (IETF) has stepped in to calm the chaos. At last September’s workshop, the task force’s message was clear: we need a single, trustworthy signal that tells crawlers to train on this document or hands off on this document. We hope to move away from messy patchworks of robots.txt.
AI Preferences Working Group (AIPREF), a newly chartered IETF working group is formed with three deliverables:
While much of the work for standardization is ongoing, we can develop strategies now to fight back against scraping incidents.
1. Set Visibility & Baseline
2. Signal Our Policy & Ground Rules
3. Add Technical Friction (Pick‑and‑Mix that suits your stack)
4. Use the Legal & Commercial Levers
5. Keep Multi-Vendor Resilience
Question to Ask | Why It Matters |
---|---|
What PII is logged & for how long? | GDPR/PDPA exposure and partner risk. |
Can I self‑host my WAF rules? | Prevents lock‑in, lets security team own playbooks. |
How are false positives surfaced? | Editorial and content creation sites cannot afford to punish real readers. |
What is the company’s stance on data‑for‑training deals? | Confirms alignment with your own policy. |
One of the most common bot mitigation vendors, Cloudflare, can be used as a solution. They offer features like Bot Fight Mode to lure bots into wasting their time following broken paths, and a feature to block AI bots and crawlers. It scores high on ease‑of‑use and integrated DDoS protection, but critics raise two recurring concerns:
A mature security posture acknowledges these trade‑offs and, where necessary, mitigates them.
To get started, you can start with this checklist to audit your protection against bot attacks.
Action | Why it Matters |
---|---|
Use robots.txt to block AI bot crawlers | Only stops ethical/respecting AI crawlers |
Write WAF Rules | Blocks stealthy or disguised bots |
Use Honeypot Trap to Reveal Bad Actors | Traps aggressive scrapers |
Monitor logs regularly | Spot any patterns and blacklist observed IPs that are bad actors |
Apply your tooling at the edge (WAF, Bot mitigation protocols) | Provides solutions for mitigating AI scraping |
Creators deserve to have a say in how their work trains tomorrow’s models. When scraping runs wild and is left unchecked, it erodes the very incentives that keep the open web alive, spiraling into what Dr. Cory Doctorow coined as the “enshittification of the Internet“.
The upside? You don’t have to wait for a standards committee to finish negotiating AI-content rules and how content gets consumed by scrapers. Engineering solutions are already at hand for you to apply and protect your content. Your defenses can start today.
Our security engineering group has helped media, healthcare, fintech and civic‑tech platforms build unbiased, multi‑layered guardrails against scraping without locking themselves to one vendor. We can:
Let’s talk about a defense that respects your users’ privacy and your IP.