Benchmarks

Explore benchmarks used to evaluate and compare ML models and AI-powered tools. Understanding how benchmarks are constructed — and their limitations — is critical for interpreting leaderboard results and making informed decisions about model selection.

Also on Nexus

Data: INQUIRE: A text-to-image retrieval benchmark with 250 expert-level ecological queries over 5M iNaturalist images.
Talk: LabelBench: A framework for benchmarking label-efficient learning combining active learning, semi-supervised learning, and transfer learning.

SWE-bench: Evaluating AI on Real-World Software Engineering

SWE-bench is a benchmark designed to evaluate whether AI models can solve real-world software engineering tasks. Rather than testing code generation in isolation, SWE-bench…