When Leaderboards Mislead: Annotation Errors in Text-to-SQL
Leaderboards drive Text-to-SQL progress—but what if the test sets are wrong? This study audits two popular benchmarks and finds widespread annotation errors that can flip who looks best.
- Error rates: 52.8% in BIRD Mini-Dev, 62.8% in Spider 2.0-Snow.
- After correcting a BIRD Dev subset, open-source agents saw relative performance shifts from -7% to +31% and rank changes of -9 to +9.
- Rankings on the uncorrected subset correlate strongly with the full Dev set (Spearman r_s=0.85), but weakly after correction (r_s=0.32)—evidence that errors distort results.
Why it matters: Companies and researchers use these leaderboards to pick systems for real deployments. If annotations are wrong, we may reward shortcuts and overlook robust methods.
Takeaway: Treat Text-to-SQL leaderboards with caution. Audit and correct annotations, report uncertainty, and prefer evaluations with verified labels.
Paper: https://arxiv.org/abs/2601.08778v1 • Code/data: https://github.com/uiuc-kang-lab/text_to_sql_benchmarks
Paper: https://arxiv.org/abs/2601.08778v1
Register: https://www.AiFeta.com
AI NLP TextToSQL Databases MachineLearning Benchmarking Reproducibility Leaderboards