Smarter LLM Pretraining: Beyond URLs
What’s new
Adding the right metadata can make LLM pretraining faster and more effective—and it’s not just about URLs.
- Fine‑grained signals work: Prepending detailed quality indicators helps models learn quicker.
- Append-and-predict: Appending metadata and training the model to predict it as an auxiliary task boosts efficiency.
- Learnable meta‑tokens: Special tokens trained with masked loss recover part of the speedup by inducing quality‑aware structure.
- Common thread: The most useful metadata encode information at finer granularity.
- Why it works: Probing shows metadata guides internal representations during learning.
Takeaways
- Prefer fine‑grained, trustworthy metadata over coarse labels.
- Either prepend metadata or use append‑and‑predict—both can help.
- Keep metadata compact to avoid token bloat.
Paper by Dongyang Fan, Diba Hashemi, Sai Praneeth Karimireddy, and Martin Jaggi. Read: https://arxiv.org/abs/2511.21613v1
Paper: https://arxiv.org/abs/2511.21613v1
Register: https://www.AiFeta.com
LLM AIResearch MachineLearning NLP Pretraining Metadata Efficiency