Skip to content

Evaluation & Benchmarks

How we measure whether AI actually works — the science, and real difficulty, of knowing if a model is any good.

Evaluation & Benchmarks is one of the core areas in the AI University map of AI. Explore the diagram, then dive into each topic — every subtopic grows into its own deep-dive over time.

flowchart LR
  M[Model] --> BENCH[Benchmarks] --> SCORE[Scores]
  M --> HUMAN[Human ratings] --> SCORE
  M --> JUDGE[LLM-as-judge] --> SCORE
  SCORE --> DECIDE{Ship or iterate?}

Key topics

  • Metrics


    Accuracy, precision/recall, F1, BLEU/ROUGE, perplexity — and when each misleads.

  • Benchmarks & leaderboards


    MMLU, GSM8K, HumanEval, MMMU and friends — standardized tests, and how they get gamed.

  • LLM-as-judge


    Using strong models to grade outputs, with their biases and calibration issues.

  • Human evaluation


    Preference ratings, head-to-head arenas (Elo), and inter-annotator agreement.

  • Red-teaming & safety evals


    Probing for harmful, jailbroken, or unsafe behavior before release.

  • Contamination & validity


    Test-set leakage, overfitting to benchmarks, and building evals you can trust.

NLP & Large Language Models · AI Safety, Alignment & Ethics · Building with AI


Learn this properly

Want hands-on training in evaluation & benchmarks? Explore AI University courses and AI School camps for kids.