Scaling Generalist Data-Analytic Agents

DataMind: a scalable recipe and dataset to train open, code-capable data-analytic agents.

Open-source data-analytic agents lag behind proprietary systems due to data scarcity, brittle training, and unstable multi-turn code execution. DataMind addresses these gaps with a full-stack recipe—data synthesis, curriculum, filtering, objectives, and rollout—to train generalist agents that parse diverse data formats and reason across long horizons.

Key ingredients:

  • Task synthesis at scale: A fine-grained taxonomy with recursive easy-to-hard composition expands diversity and difficulty of analytical queries.
  • Quality trajectories: Knowledge-augmented sampling followed by model- and rule-based filtering yields clean, instructive demonstrations.
  • Balanced objectives: A dynamically adjustable blend of SFT and RL stabilizes learning while pushing capability.
  • Stable multi-turn code rollout: Memory-frugal execution improves reliability in code-based tool use.

Deliverables include DataMind-12K, a high-quality trajectory set spanning domains, task types, and file formats. Models trained on it achieve standout results: DataMind-14B reaches 71.16% average across data-analysis benchmarks, outperforming strong proprietary baselines (e.g., DeepSeek‑V3.1 and GPT‑5), while DataMind-7B leads among open models at 68.10%. The authors will release DataMind-12K and model checkpoints to accelerate community progress.

Why it matters: Real-world analytics demands long-horizon reasoning, tool use, and robustness across messy data—precisely what DataMind targets. With an open, reproducible pipeline, it provides a practical path to scalable, capable, and transparent analytic agents.

Paper: arXiv: DataMind
Register: https://www.AiFeta.com

#Agents #DataAnalytics #CodeAgents #SFT #ReinforcementLearning #OpenSource #ToolUse #LLM

Read more