Cogito, Ergo Ludo: An Agent that Learns to Play by Reasoning and Planning

Kari Jaaskelainen

30 Sep 2025 — 1 min read

From tabula rasa to transparent mastery: an LLM agent that induces rules and writes its own playbook

Most deep RL agents master games by amassing opaque experience. CEL—Cogito, ergo ludo—takes a different path: it learns to play by reasoning and planning. Powered by a Large Language Model, CEL explicitly infers environment rules and synthesizes a strategic playbook from raw episodes, starting with no prior knowledge beyond the action set.

After each episode, CEL reflects on the entire trajectory via two concurrent processes. Rule Induction refines a language‑based model of environment dynamics. Strategy and Playbook Summarization distills experiences into actionable guidance for future decisions. This cycle of interaction and reflection yields an agent that not only improves, but can also explain what it believes about the world and why its policy should work.

Evaluated on Minesweeper, Frozen Lake, and Sokoban—diverse grid‑world tasks with sparse rewards—the agent autonomously discovers fundamental mechanics and develops effective policies. Ablations confirm the necessity of iterative reflection: the loop is what sustains learning, bridging the gap between raw experience and explicit, reusable knowledge.

Why it matters: CEL points toward interpretable agents that learn generalizable abstractions, not just policies. By externalizing knowledge in language, teams can inspect, debug, and adapt strategies—crucial for safety and transfer.

Who should care: RL researchers exploring symbolic‑neural hybrids, builders of transparent decision‑making systems, and practitioners who need verifiable reasoning in safety‑critical environments.

Paper: arXiv: Cogito, Ergo Ludo
Register: AiFeta

#RL #LLMAgents #Planning #Explainability #InterpretableAI #GameAI #Reasoning

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning