Evaluation Frameworks

Evaluation Frameworks

‘How to measure intelligence’.

Evaluation frameworks are designed to provide a structured approach to evaluating LLMs across various dimensions, ensuring that they meet the necessary standards for accuracy and safety.

Notable frameworks include:

  • The General Language Understanding Evaluation (GLUE) Benchmark is a collection of resources used to train, evaluate, and analyse natural language processing systems. It provides a composite metric across nine tasks to assess a model's language capabilities.
  • SuperGLUE is an evolution of the GLUE benchmark, designed to be more difficult and sophisticated. It focuses on a set of more complex language understanding tasks, offering a public leaderboard for model evaluation
  • The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset that consists of questions posed by crowdworkers on a set of Wikipedia articles. In this dataset, the answer to every question is either a segment of text from the corresponding reading passage or the question is unanswerable. To perform well, systems must not only answer questions when possible but also determine when no answer is supported by the paragraph and abstain from answering
  • Hugging Face provides a vast library of NLP datasets and evaluation metrics for various tasks, including text classification, summarization, translation, and more.
  • BLEU (Bilingual Evaluation Understudy) is an algorithm used to evaluate the quality of machine-translated text by comparing it to human translations. BLEU offers a standardized and objective way to evaluate LLM performance in tasks that involve generating text outputs based on input prompts or contexts
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to compare the quality of generated text against reference text. ROUGE metrics calculate precision (overlapping n-grams divided by total n-grams in the machine-generated text) and recall (overlapping n-grams divided by total n-grams in the reference text). The F1 Score, a harmonic mean of precision and recall, provides a single metric to gauge the quality of the generated text.
  • LabelledRAGDataset The LabelledRagDataset is designed to test RAG systems by predicting responses to queries and comparing them with reference answers and contexts. It consists of examples where each example includes a query, a reference answer (ground-truth answer), and reference contexts used to generate the reference answer.

Sources

  1. https://www.lakera.ai/blog/large-language-model-evaluation
  2. https://www.deeplearning.ai/resources/natural-language-processing/
  3. https://rajpurkar.github.io/SQuAD-explorer/
  4. https://huggingface.co/spaces/evaluate-metric/bleu