LLM-as-a-Judge: Can AI pick the best slate for you?
Can an LLM judge the best playlist, not just the next song?
Recommender systems often serve slates—ordered lists like your home feed or a playlist. Modeling what a person prefers across domains is hard.
This study tests Large Language Models as a 'world model' of user preferences: the LLM compares two slates and reasons which one a user would like more. The authors benchmark several LLMs on three tasks and datasets, then link performance to properties of the underlying preference function.
- LLMs capture useful structure in preferences via pairwise reasoning.
- Performance rises and falls with how consistent and expressive the preference signals are.
- Results point to clear improvement paths for prompts, training, and evaluation.
Why it matters: LLM 'judges' could make slate recommenders more robust, handle cold starts, and generalize beyond a single domain.
Paper: http://arxiv.org/abs/2511.04541v1 — by Baptiste Bonin, Maxime Heuillet, and Audrey Durand.
Paper: http://arxiv.org/abs/2511.04541v1
Register: https://www.AiFeta.com
LLM RecommenderSystems SlateRecommendation Personalization AI MachineLearning WorldModels IR Research