Data Sources for Pre-Training LLMs

Data Sources for Pre-Training LLMs

The largest sources of available training data include:

  1. Common Crawl comprises terabytes of raw web data extracted from billions of web pages. It is a significant source for training LLMs, with models like GPT-3, LLaMA, OpenLLaMa, and T5 being trained using Common Crawl.
  2. RefinedWeb a massive corpus of deduplicated and filtered tokens from the Common Crawl dataset, with over 5 trillion tokens of textual data, of which 600 billion are made publicly available. It was developed to train the Falcon-40B model with smaller-sized but high-quality datasets.
  3. The Pile is an 800 GB corpus that incorporates a wide variety of subjects, writing styles, and domains, including data from scientific articles, books, web pages, and other text sources.
  4. GitHub is a large cloud-based platform that manages and stores developers' code. It has been used as a source for training LLMs, given its vast amount of data.
  5. Wikipedia contains a massive amount of information across various domains and is used for training LLMs.
  6. Books Corpus comprises millions of books and has been used for training LLMs.

Sources

  1. https://kili-technology.com/large-language-models-llms/9-open-sourced-datasets-for-training-large-language-models
  2. https://www.mdfarook.com/what-are-the-real-data-sources-for-training-llms/