Common Crawl

CCBot

The backbone of modern AI; CCBot builds the vast, open-source web corpus used to train nearly all major LLMs.

Purpose: Open web indexing for public datasets

Quick Facts

Respects robots.txt
Yes
Last Updated
2025-05
Official Documentation

📊 Popularity & Traffic

#1Ranking among AI crawlers

Over 80% of public AI datasets and benchmarks incorporate Common Crawl data.

🤖 User Agent Strings

Use these patterns to identify CCBot in your server logs or configure your robots.txt file.

CCBot

Respects robots.txt

Common Crawl foundational bot

CCBot/2.0

🌐 IP Ranges

Source: Official Common Crawl JSON file
Official source file

Identified IP Ranges4 Ranges

18.97.9.168/29
Subnet with 8 addresses
18.97.14.80/29
Subnet with 8 addresses
18.97.14.88/30
Subnet with 4 addresses
98.85.178.216/32
Subnet with 1 addresses

How to read CIDR notation:

The /28 suffix indicates a block of 16 IP addresses. For example,.112/28 covers all addresses from .112 up to .127. Adding these to your firewall will block the entire range used by CCBot.

📝 Robots.txt Configuration

Add the following to your robots.txt file to block CCBot:

User-agent: CCBot
Disallow: /

💡 Important Notes

  • Fully respects robots.txt and honors crawl-delay directives
  • Mission is to create a free, publicly accessible snapshot of the web
  • Verify CCBot via reverse DNS - legitimate IPs resolve to *.crawl.commoncrawl.org
  • Does not store personal or restricted info if disallowed
Beyond blocking crawlers

See what AI is saying about your brand

Understanding crawlers is step one. With Aiso, you can see the actual conversations happening about your brand inside ChatGPT, Claude, and Perplexity.