Small, smart, and synthetic: distilling data for pre-trained vision models

Kari Jaaskelainen

21 Nov 2025 — 1 min read

Big vision models are now trained once and reused via simple "linear probes." This paper asks: can a tiny set of synthetic images replace massive real datasets for training those probes?

Enter Linear Gradient Matching: it learns a handful of synthetic images so that, after a frozen feature extractor (e.g., DINO, CLIP), they provoke nearly the same gradients in the linear classifier as real data.

Outperforms real-image baselines for linear probing in the authors' tests.
Generalizes across models: a set distilled with a DINO backbone can train a competitive CLIP probe.
Excels on fine-grained categories.
Doubles as an interpretability tool—revealing similarity between models’ embedding spaces and flagging spurious correlations on adversarial datasets.

Why it matters: faster prototyping, lower storage and compute, and safer data sharing—without starting from scratch.

Paper: https://arxiv.org/abs/2511.16674v1

Register: https://www.AiFeta.com

#AI #ComputerVision #DatasetDistillation #SelfSupervisedLearning #ML #CLIP #DINO #Interpretability

Meet GRACE: a moral governor for safer, more transparent AI

AI agents are getting powerful—so how do we make sure they do the right thing, not just the effective thing? Meet GRACE, a reason-based moral governor that keeps AI behavior aligned with human norms by separating moral reasoning from goal-driven decision-making. * Moral Module: uses deontic logic and explicit reasons

Why some fine-tuned LLMs miss phishing—and how to fix it

Not all fine-tuned LLMs spot phishing equally. A new study tests Llama 3.1 8B, Gemma 2 9B, and Mistral on high-stakes phishing detection—and uses SHAP and mechanistic interpretability to reveal why models do (or don’t) generalize. * Architecture × data diversity matters: Gemma 2 9B hits state-of-the-art performance (F1

AI that explains itself—by following the science

AI that explains itself—by following the science Black-box AI can be powerful, but it often can’t tell us why it made a decision. Concept Bottleneck Models (CBMs) try to fix that by predicting human-understandable concepts first, then the final answer. The catch: standard CBMs ignore domain-specific cause-and-effect and

Meet GenomAgent: A Team of AI Specialists for Smarter Genomics Q&A

TL;DR: Finding reliable answers in genomics is hard. GenomAgent turns one big AI into a coordinating team of specialists—and beats the current leader by 12% on a key benchmark. Genomic facts live across many databases. Standard chatbots struggle because they can’t flexibly query those sources. GeneGPT added

Read more

Meet GRACE: a moral governor for safer, more transparent AI

Why some fine-tuned LLMs miss phishing—and how to fix it

AI that explains itself—by following the science

Meet GenomAgent: A Team of AI Specialists for Smarter Genomics Q&A