AI
Atlas-Alignment: Making Interpretability Transferable Across Language Models
Interpreting what large language models “think” is slow and expensive. Each new model often needs custom tools and lots of manual labeling. Atlas-Alignment offers a shortcut. Instead of rebuilding everything, it lines up a new model’s hidden activity with a shared, human-labeled Concept Atlas using only overlapping inputs and