Scaling Synthetic Task Generation for Agents via Exploration

Kari Jaaskelainen

30 Sep 2025 — 1 min read

AutoPlay explores apps first—then synthesizes diverse, verifiable tasks at scale.

Training capable UI agents is bottlenecked by the scarcity of high-quality, grounded downstream tasks. AutoPlay tackles this by explicitly exploring interactive environments to discover capabilities and states before generating tasks. The result: diverse, executable, and verifiable tasks that reflect what the environment can actually do—without heavy human annotation.

The pipeline has two stages:

Exploration: An MLLM explorer systematically uncovers novel app states and functionalities.
Task Generation: A task generator uses exploration trajectories plus guideline prompts to synthesize tasks with built-in verifiability.

At scale, AutoPlay produces 20k tasks over 20 Android apps and 10k tasks over 13 Ubuntu apps. These tasks enable automated demonstration synthesis via an MLLM executor and verifier. Training on this data boosts agent success rates by up to 20.0% (mobile-use) and 10.9% (computer-use). Adding RL with verifier-based rewards yields a further +5.7% gain—evidence that exploration-grounded tasks are a foundation for scalable post-training.

Why it matters: Grounding tasks in actually reachable states solves a chronic problem in synthetic data generation—plausible but infeasible instructions. AutoPlay’s exploration-first design improves diversity, feasibility, and coverage, making it a powerful engine for building practical UI agents.

Paper: arXiv: AutoPlay
Register: https://www.AiFeta.com

#Agents #TaskGeneration #UIAutomation #MLLM #ReinforcementLearning #Scaling #DataGeneration #HCI

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning