Teaching LLMs to Decide Better, One Regret at a Time

Kari Jaaskelainen

07 Nov 2025 — 1 min read

LLMs are starting to act as “agents,” making choices in dynamic settings—but they often struggle with exploration vs. exploitation and rack up high regret (missed reward vs. the best strategy in hindsight).

A new approach, Iterative Regret-Minimization Fine-Tuning (Iterative RMFT), trains models to decide better by learning from their own best trials:

Roll out many decision trajectories.
Pick the k lowest‑regret ones.
Fine-tune the model on those trajectories and their natural‑language reasoning.
Repeat.

Unlike methods that copy fixed algorithms or rely on rigid chain‑of‑thought templates, Iterative RMFT uses regret as the signal—letting the model surface its own rationales without heavy output engineering.

Results: improved decision-making across different model types (from numeric I/O Transformers to open‑weight LLMs and even GPT‑4o mini) and across tasks with varied horizons, action spaces, rewards, and language contexts. Theory also shows a single‑layer Transformer can be a no‑regret learner in a simplified setting.

Why it matters: fewer brittle prompts, more adaptable, lower‑regret AI agents. Paper by Park, Chen, Ozdaglar, Zhang. Link: http://arxiv.org/abs/2511.04393v1

Paper: http://arxiv.org/abs/2511.04393v1

Register: https://www.AiFeta.com

AI LLM Agents DecisionMaking ReinforcementLearning MachineLearning Research

Automating GDPR Compliance: A Roadmap for Companies and Law Firms

GDPR compliance is more than checkboxes. A new roadmap from the Privatech project shows how automation and machine learning can help companies and law firms assess—and even generate—privacy compliance. * Shift the focus to data processors’ real workflows: drafting policies, mapping data uses, documenting decisions. * Break compliance into machine-ready

FPGAs for Faster, Leaner Deep Learning: A Review of CNN Accelerators

Deep learning drives image search, robots, and medical scans. Most systems lean on CPUs and GPUs. This review asks: what if we run convolutional neural networks (CNNs) on FPGAs—reconfigurable chips you can tailor to the model? * Why FPGAs: custom dataflows, low latency, and strong energy efficiency—great for cameras,

Dynamic-K: Recommendations That Know When to Stop

Most apps show a fixed number of “top” items—say 10 movies or 20 products—assuming there are always enough good options. But that’s not always true: sometimes there are few relevant items, or some users are extra picky. The result? Filler recommendations. Dynamic-K flips the script. Instead of

Teaching chatbots to stop contradicting themselves (DECODE)

Teaching chatbots to stop contradicting themselves Ever had a bot say one thing, then the opposite a few turns later? This study introduces DECODE—a new task and dataset for spotting contradictions in everyday conversations, drawn from both human-human and human-bot chats. * New data beats existing natural language inference (NLI)

Read more

Automating GDPR Compliance: A Roadmap for Companies and Law Firms

FPGAs for Faster, Leaner Deep Learning: A Review of CNN Accelerators

Dynamic-K: Recommendations That Know When to Stop

Teaching chatbots to stop contradicting themselves (DECODE)