Scaling Generalist Data-Analytic Agents
DataMind: a scalable recipe and dataset to train open, code-capable data-analytic agents.
Open-source data-analytic agents lag behind proprietary systems due to data scarcity, brittle training, and unstable multi-turn code execution. DataMind addresses these gaps with a full-stack recipe—data synthesis, curriculum, filtering, objectives, and rollout—to train generalist agents that parse diverse data formats and reason across long horizons.
Key ingredients:
- Task synthesis at scale: A fine-grained taxonomy with recursive easy-to-hard composition expands diversity and difficulty of analytical queries.
- Quality trajectories: Knowledge-augmented sampling followed by model- and rule-based filtering yields clean, instructive demonstrations.
- Balanced objectives: A dynamically adjustable blend of SFT and RL stabilizes learning while pushing capability.
- Stable multi-turn code rollout: Memory-frugal execution improves reliability in code-based tool use.
Deliverables include DataMind-12K, a high-quality trajectory set spanning domains, task types, and file formats. Models trained on it achieve standout results: DataMind-14B reaches 71.16% average across data-analysis benchmarks, outperforming strong proprietary baselines (e.g., DeepSeek‑V3.1 and GPT‑5), while DataMind-7B leads among open models at 68.10%. The authors will release DataMind-12K and model checkpoints to accelerate community progress.
Why it matters: Real-world analytics demands long-horizon reasoning, tool use, and robustness across messy data—precisely what DataMind targets. With an open, reproducible pipeline, it provides a practical path to scalable, capable, and transparent analytic agents.
Paper: arXiv: DataMind
Register: https://www.AiFeta.com
#Agents #DataAnalytics #CodeAgents #SFT #ReinforcementLearning #OpenSource #ToolUse #LLM