When Leaderboards Mislead: Annotation Errors in Text-to-SQL

When Leaderboards Mislead: Annotation Errors in Text-to-SQL

Leaderboards drive Text-to-SQL progress—but what if the test sets are wrong? This study audits two popular benchmarks and finds widespread annotation errors that can flip who looks best.

  • Error rates: 52.8% in BIRD Mini-Dev, 62.8% in Spider 2.0-Snow.
  • After correcting a BIRD Dev subset, open-source agents saw relative performance shifts from -7% to +31% and rank changes of -9 to +9.
  • Rankings on the uncorrected subset correlate strongly with the full Dev set (Spearman r_s=0.85), but weakly after correction (r_s=0.32)—evidence that errors distort results.

Why it matters: Companies and researchers use these leaderboards to pick systems for real deployments. If annotations are wrong, we may reward shortcuts and overlook robust methods.

Takeaway: Treat Text-to-SQL leaderboards with caution. Audit and correct annotations, report uncertainty, and prefer evaluations with verified labels.

Paper: https://arxiv.org/abs/2601.08778v1 • Code/data: https://github.com/uiuc-kang-lab/text_to_sql_benchmarks

Paper: https://arxiv.org/abs/2601.08778v1

Register: https://www.AiFeta.com

AI NLP TextToSQL Databases MachineLearning Benchmarking Reproducibility Leaderboards

Read more