The Era of Real-World Human Interaction: RL from User Conversations

Kari Jaaskelainen

30 Sep 2025 — 1 min read

RL from Human Interaction links long‑term personas to turn‑level preferences for scalable, personalized alignment

Most alignment today leans on expert‑crafted datasets and static preferences. This paper proposes Reinforcement Learning from Human Interaction (RLHI): learning directly from in‑the‑wild user conversations. Instead of treating chats as fixed supervision, RLHI turns the stream of organic feedback into policy improvement, leading to better personalization and instruction‑following without manual, one‑off annotation.

The authors introduce two complementary methods. First, User‑Guided Rewrites: the model revises unsatisfactory responses by interpreting the user’s natural‑language follow‑ups as corrective signals. Second, User‑Based Rewards: a reward model is conditioned on a user’s long‑term interaction history (persona), linking durable preferences to immediate decisions via persona‑conditioned preference optimization.

Trained on WildChat‑derived conversations, both variants outperform strong baselines on personalization and instruction following. Notably, the same style of feedback also boosts reasoning benchmarks—evidence that natural, longitudinal human signals can scale both alignment and capability.

Why it matters: alignment that adapts with each user’s evolving needs is essential for trustworthy assistants. RLHI reframes messy real‑world conversations as a rich, continuous supervision source—one that scales naturally as user pools grow.

Who should care: teams building personalized assistants, platforms running persistent chat experiences, and researchers exploring long‑horizon preference learning that goes beyond one‑shot human feedback.

Paper: arXiv: RL from User Conversations
Register: AiFeta

#RLHF #Personalization #Alignment #ConversationalAI #RL #UserModeling #LLM

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning