AI
When Leaderboards Mislead: Annotation Errors in Text-to-SQL
Leaderboards drive Text-to-SQL progress—but what if the test sets are wrong? This study audits two popular benchmarks and finds widespread annotation errors that can flip who looks best. * Error rates: 52.8% in BIRD Mini-Dev, 62.8% in Spider 2.0-Snow. * After correcting a BIRD Dev subset, open-source agents