Cogito, Ergo Ludo: An Agent that Learns to Play by Reasoning and Planning

Kari Jaaskelainen

30 Sep 2025 — 1 min read

CEL learns rules and strategies from scratch by reflecting on its own trajectories.

Instead of memorizing behaviors from vast experience, CEL (Cogito, ergo ludo) learns to play by reasoning and planning in language. Starting tabula rasa (knowing only the action set), the agent interacts with an environment, then reflects on the full episode to update two explicit, human-readable artifacts: (1) a rule model of environment dynamics and (2) a strategic playbook distilled from experience.

This interaction–reflection cycle powers two concurrent processes:

Rule Induction: Discover and refine the environment’s mechanics from observed trajectories.
Strategy and Playbook Summarization: Extract reusable, actionable tactics for future episodes.

Evaluated on grid-world tasks such as Minesweeper, Frozen Lake, and Sokoban, CEL autonomously uncovers rules and learns effective policies from sparse rewards—no external annotations or prior task knowledge required. Ablations confirm the iterative reflection is essential for sustained improvement, suggesting a promising route to interpretable, general agents that explain not just what they do, but why.

Why it matters: By encoding knowledge in language, CEL provides transparency, editability, and transfer. Designers can inspect the learned rules and strategies, adapt them to new settings, or deliberately scaffold learning—bridging the gap between black-box performance and human-understandable intelligence.

Paper: arXiv: Cogito, Ergo Ludo
Register: https://www.AiFeta.com

#Agents #Planning #Reasoning #GameAI #Interpretability #LLM #SelfReflection #Explainability

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning