The Era of Real-World Human Interaction: RL from User Conversations
RL from Human Interaction links long‑term personas to turn‑level preferences for scalable, personalized alignment
Most alignment today leans on expert‑crafted datasets and static preferences. This paper proposes Reinforcement Learning from Human Interaction (RLHI): learning directly from in‑the‑wild user conversations. Instead of treating chats as fixed supervision, RLHI turns the stream of organic feedback into policy improvement, leading to better personalization and instruction‑following without manual, one‑off annotation.
The authors introduce two complementary methods. First, User‑Guided Rewrites: the model revises unsatisfactory responses by interpreting the user’s natural‑language follow‑ups as corrective signals. Second, User‑Based Rewards: a reward model is conditioned on a user’s long‑term interaction history (persona), linking durable preferences to immediate decisions via persona‑conditioned preference optimization.
Trained on WildChat‑derived conversations, both variants outperform strong baselines on personalization and instruction following. Notably, the same style of feedback also boosts reasoning benchmarks—evidence that natural, longitudinal human signals can scale both alignment and capability.
Why it matters: alignment that adapts with each user’s evolving needs is essential for trustworthy assistants. RLHI reframes messy real‑world conversations as a rich, continuous supervision source—one that scales naturally as user pools grow.
Who should care: teams building personalized assistants, platforms running persistent chat experiences, and researchers exploring long‑horizon preference learning that goes beyond one‑shot human feedback.
Paper: arXiv: RL from User Conversations
Register: AiFeta
#RLHF #Personalization #Alignment #ConversationalAI #RL #UserModeling #LLM