Common Crawl
Category: Protocols & Standards
Category: Protocols & Standards
Definition
Common Crawl is a massive, freely available web crawl dataset that serves as a standard source for training large language models and analyzing web content.
How It Works
Common Crawl regularly archives billions of web pages in a standardized WARC format. It provides both raw HTML and extracted text, along with metadata about each page.
The dataset is stored on AWS S3 and updated monthly, with historical crawls dating back to 2008.
Why It Matters
Common Crawl democratizes access to web-scale data, enabling researchers and startups to train competitive AI models. Most major language models use Common Crawl as a primary data source.
The standardized format and free availability have made it foundational infrastructure for NLP research.
← Back to Protocols & Standards | All Terms