Common Crawl

CCBot

The backbone of modern AI; CCBot builds the vast, open-source web corpus used to train nearly all major LLMs.

Purpose: Open web indexing for public datasets

Quick Facts

Company: Common Crawl
Respects robots.txt: Yes
Last Updated: 2025-05

Official Documentation

📊 Popularity & Traffic

#1Ranking among AI crawlers

Over 80% of public AI datasets and benchmarks incorporate Common Crawl data.

🤖 User Agent Strings

Use these patterns to identify CCBot in your server logs or configure your robots.txt file.

CCBot

Respects robots.txt

Common Crawl foundational bot

User Agent Pattern:CCBot/2.0

🌐 IP Ranges

Source: Official Common Crawl JSON file

Official source file

Identified IP Ranges4 Ranges

18.97.9.168/29

Subnet with 8 addresses

18.97.14.80/29

Subnet with 8 addresses

18.97.14.88/30

Subnet with 4 addresses

98.85.178.216/32

Subnet with 1 addresses

How to read CIDR notation:

The /28 suffix indicates a block of 16 IP addresses. For example,.112/28 covers all addresses from .112 up to .127. Adding these to your firewall will block the entire range used by CCBot.

📝 Robots.txt Configuration

Add the following to your robots.txt file to block CCBot:

User-agent: CCBot
Disallow: /

💡 Important Notes

Fully respects robots.txt and honors crawl-delay directives
Mission is to create a free, publicly accessible snapshot of the web
Verify CCBot via reverse DNS - legitimate IPs resolve to *.crawl.commoncrawl.org
Does not store personal or restricted info if disallowed

Other AI Crawlers

OpenAI

GPTBot

Model training data collection

OpenAI

OAI-SearchBot

ChatGPT search indexing and citations

OpenAI

ChatGPT-User

Real-time webpage fetching for user queries

Google

Googlebot

Search indexing and AI model training

Beyond blocking crawlers

See what AI is saying about your brand

Understanding crawlers is step one. With Aiso, you can see the actual conversations happening about your brand inside ChatGPT, Claude, and Perplexity.

Start Free Trial Book a Demo