The data for a RAG pipeline has to be accurate and comprehensive. But in contrast to fine-tuning methods, the retrieval data isn’t being used to train the model. Which means it doesn’t have to be supplied so extensively.
In order to optimise the retrieval process, the documents are first pre-processed i.e. transformed into a data format that can be efficiently searched.
This typically involves extracting text from the documents, applying metadata, tokenizing the text, and creating vectors from the tokens.