Common Crawl
CCBot
The backbone of modern AI; CCBot builds the vast, open-source web corpus used to train nearly all major LLMs.
Purpose: Open web indexing for public datasets
📊 Popularity & Traffic
#1Ranking among AI crawlers
Over 80% of public AI datasets and benchmarks incorporate Common Crawl data.
🤖 User Agent Strings
Use these patterns to identify CCBot in your server logs or configure your robots.txt file.
CCBot
Respects robots.txtCommon Crawl foundational bot
CCBot/2.0🌐 IP Ranges
Source: Official Common Crawl JSON file
Official source fileIdentified IP Ranges4 Ranges
18.97.9.168/29Subnet with 8 addresses
18.97.14.80/29Subnet with 8 addresses
18.97.14.88/30Subnet with 4 addresses
98.85.178.216/32Subnet with 1 addresses
How to read CIDR notation:
The/28 suffix indicates a block of 16 IP addresses. For example,.112/28 covers all addresses from .112 up to .127. Adding these to your firewall will block the entire range used by CCBot.📝 Robots.txt Configuration
Add the following to your robots.txt file to block CCBot:
User-agent: CCBot
Disallow: /💡 Important Notes
- Fully respects robots.txt and honors crawl-delay directives
- Mission is to create a free, publicly accessible snapshot of the web
- Verify CCBot via reverse DNS - legitimate IPs resolve to *.crawl.commoncrawl.org
- Does not store personal or restricted info if disallowed
Beyond blocking crawlers
See what AI is saying about your brand
Understanding crawlers is step one. With Aiso, you can see the actual conversations happening about your brand inside ChatGPT, Claude, and Perplexity.