Data Sources for Pre-Training LLMs

Tags

The largest sources of available training data include:

Common Crawl comprises terabytes of raw web data extracted from billions of web pages. It is a significant source for training LLMs, with models like GPT-3, LLaMA, OpenLLaMa, and T5 being trained using Common Crawl.
RefinedWeb a massive corpus of deduplicated and filtered tokens from the Common Crawl dataset, with over 5 trillion tokens of textual data, of which 600 billion are made publicly available. It was developed to train the Falcon-40B model with smaller-sized but high-quality datasets.
The Pile is an 800 GB corpus that incorporates a wide variety of subjects, writing styles, and domains, including data from scientific articles, books, web pages, and other text sources.
GitHub is a large cloud-based platform that manages and stores developers' code. It has been used as a source for training LLMs, given its vast amount of data.
Wikipedia contains a massive amount of information across various domains and is used for training LLMs.
Books Corpus comprises millions of books and has been used for training LLMs.