MiClip: Learning to Interpret Representation in Vision Models

International Conference on Learning Representations (ICLR) 2026

ShanghaiTech University
*Equal contribution, Corresponding author
Interpolation end reference image.
Figure 1. Framework of MiClip: (a) the target models (frozen); (b) contrastive learning of the shared embedding space; (c) mechanistic feature localization and description; (d) model steering; (e) feature disentanglement with k-SAE (Optional).

Abstract

Vision models have demonstrated remarkable capabilities, yet their decision-making processes remain largely opaque. Mechanistic interpretability (MI) offers a promising avenue to decode these internal workings. However, existing interpretation methods suffer from two key limitations. First, they rely on the flawed activation-magnitude assumption, assuming that the importance of a neuron is directly reflected by the magnitude of its activation, which ignores more nuanced causal roles. Second, they are predominantly input-centric, failing to capture the causal mechanisms that drive a model's output. These shortcomings lead to inaccurate and unreliable internal representation interpretations, especially in cases of incorrect predictions. We propose MiClip (Mechanism-Interpretability via Contrastive Learning), a novel framework that extends CLIP’s contrastive learning to align internal mechanisms of vision models with general semantic concepts, enabling interpretable and controllable representations. Our approach circumvents previous limitations by performing multimodal alignment between a model's internal representations and both its input concepts and output semantics via contrastive learning. We demonstrate that MiClip is a general framework applicable to diverse representation unit types, including individual neurons and sparse autoencoder (SAE) features. By enabling precise, causal-aware interpretation, MiClip not only reveals the semantic properties of a model's internals but also paves the way for effective and targeted manipulation of model behaviors.

Motivation

A range of interpretability methodologies have been proposed to decode the internal representations of vision models into human-understandable semantics. Despite these advances, existing approaches face notable limitations as follows (Figure 2).

  • Relying on activation-magnitude assumption: for any representation unit, a larger activation value is interpreted as indicating a stronger presence of the unit’s associated concept in the model’s information-processing pipeline.
  • Remains input-centric. Many methods only focus on aligning internal representations with concepts present in the input, missing the model's internal concept processing result.
Concept diagram
Figure 2. Comparison between MiClip and previously proposed methods with mentioned limitations.

To overcome these issues, we propose MiClip, a representation-based automated mechanistic interpretability framework that bridges the gap between the internal mechanisms of vision models and human-understandable concepts, through aligning them in a shared semantic embedding space. It eliminates the reliance on the activation-magnitude assumption and jointly leverages semantic signals from both the inputs and outputs of vision models, thereby capturing a more comprehensive and faithful view of the model’s reasoning process.

Methodology

MiClip is a two-phase framework that aligns the model internals with human-understandable concepts: 1. Create a shared embedding space for model mechanisms and human semantics through Contrastive Learning; 2. Describe model internals with texts or localize human-understandable concepts in the model. The above two procedures enable MiClip to support fine-grained model steering through unit-level interventions.

Mechanism-Concept Alignment via Contrastive Learning

With pretrained CLIP models, MiClip avoids relying on manual heuristics and directly maps the internal representations of vision models into the human-understandable visual-language semantic space of CLIP.

Specifically, we project the input image $x_i \in \mathcal{X}$ and the model's predicted label $\hat{c}_i$ with CLIP's vision encoder $\mathrm{E}_{\text{i}}(\cdot)$ and text encoder $\mathrm{E}_{\text{c}} (\cdot)$ into the embedding space. We then learn a neuron encoder $\mathrm{E}_{\text{n}}(\cdot; \theta_{\text{n}})$ that maps the model's hidden representations $\mathbf{a}_i \in \mathcal{A}$ to the same space. Finally, we optimize the symmetric InfoNCE loss to align the model's internal space with the visual-language semantic space.

$$ \mathcal{L}_{\mathrm{alignment}} = \underbrace{\mathcal{L}_{\mathrm{CLIP}}^{\text{out}} \left(\mathrm{E_n}(\mathcal{A}; \theta_{\text{n}}), \mathrm{E_c}(\{\hat{c}_i\}_{i=1}^N) \right)}_{\text{neuron-concept loss}} + \underbrace{\mathcal{L}_{\mathrm{CLIP}}^{\text{in}} \left(\mathrm{E_n}(\mathcal{A}; \theta_{\text{n}}), \mathrm{E_i}(\mathcal{X}) \right)}_{\text{neuron-image loss}} $$

Mechanism Localization and Description

Once trained, MiClip enables both Concept-to-Mechanism Localization with open-vocabulary concepts, and Mechanism-to-Concept Description for either activation neurons or learned Sparse Autoencoder (SAE) features. This is achieved by comparing the cosine similarities $\operatorname{sim}(\cdot,\cdot)$ between embeddings.

Concept-to-Mechanism Localization

Given a concept $c$, we want to locate the neurons or features that closely align with it from the set $\mathcal{U}$ of all possible representation units. We identify the top $\tau$ $(\text{SelectTop-}\tau)$ representation units related to $c$ and record their indices according to their similarity score in the set $\mathbf{L}_c$:

$$ \mathbf{L}_c = \underset{i}{\text{SelectTop-}\tau} \left( \{\mathbf{sim}(u_i, c)\}_{u_i \in \mathcal{U}} \right). $$

Mechanism-to-Concept Description

Given a representation unit $u$ (either a neuron or an SAE feature), find the concepts from a set $\mathcal{C}$ that best describe the mechanism. We identify the top $\tau$ concepts related to a representation unit $u$ and record the concepts according to their relevance score in the set $\mathbf{D}_u$:

$$ \mathbf{D}_u = \underset{j}{\text{SelectTop-}\tau} \left( \{\mathbf{sim}(u, c_j)\}_{c_j \in \mathcal{C}} \right). $$

Model Control with Mechanism Intervention

For a given concept $c$, we collect its corresponding units indexed by $\mathbf{L}_c$ (neurons or SAE features), and adjust their activations to suppress or amplify the concept’s influence on the model.

For each representation unit $u$ ($u=a_i \cdot e^{(i)}$ or $u=f_i$) indexed within $\mathbf{L}_{c}$, we either apply a scalar multiplication or add an additive bias as

$$ \tilde{u}_i = \beta u_i \text{ (Scaling)} \quad \text{or} \quad \tilde{u}_i = u_i + \beta\text{ (Adding)}, \quad \forall i \in \mathbf{L}_{c}, \beta \in \mathbb{R}. \quad $$

By applying a distinct parameter $\beta$, we can suppress or amplify the target feature to adjust the model.

Results

Intervention on Discriminative Vision Models

We verify our localization by intervening on the top-5 neurons or features for each ImageNet-1k concept. We measure the change in classification accuracy $\Delta Acc$ after applying either enhancement ($\times 2$ scaling) or removal ($\times 0$ scaling) after interventions on the activations of the following layers: the 10th layer for ViT-B/16 and CLIP/ViT-B-16, and the stages.3.layers.1.shortcut layer for ResNet-50.

Results show that MiClip enables precise localization of the mechanisms that govern model classification. Baselines, like Act-Values, even exhibit an inconsistent responses, as shown in red.

(a) Intervention on neurons

Method Enhancement ΔAcc (%) (↑) Removal ΔAcc (%) (↓)
ResNet-50 ViT-B/16 CLIP ResNet-50 ViT-B/16 CLIP
Act-Values 2.27 (± 0.03) -0.19 (± 0.01) -8.05 (± 0.02) -8.98 (± 0.08) -1.43 (± 0.01) -23.30 (± 0.00)
Network Dissection 0.78 (± 0.02) 0.35 (± 0.01) 0.23 (± 0.02) -2.95 (± 0.15) -0.37 (± 0.03) -0.88 (± 0.05)
CLIP-dissect 3.05 (± 0.18) 0.19 (± 0.03) -0.04 (± 0.03) -12.31 (± 0.67) -0.04 (± 0.02) -1.16 (± 0.14)
V-Interp 1.71 (± 0.22) -0.04 (± 0.02) -0.29 (± 0.10) -8.04 (± 0.71) -0.04 (± 0.00) -0.14 (± 0.05)
MiClip (Ours) 5.32 (± 0.03) 0.18 (± 0.01) 1.10 (± 0.05) -17.24 (± 0.05) -0.04 (± 0.02) -1.50 (± 0.08)

(b) Intervention on SAE features

Method Enhancement ΔAcc (%) (↑) Removal ΔAcc (%) (↓)
ResNet-50 ViT-B/16 CLIP ResNet-50 ViT-B/16 CLIP
Act-Values 4.34 (± 0.00) 3.68 (± 0.02) 0.43 (± 0.08) -11.98 (± 0.00) -22.77 (± 0.11) -15.94 (± 0.06)
Network Dissection 0.02 (± 0.04) 1.12 (± 0.05) 0.50 (± 0.06) -0.08 (± 0.05) -4.03 (± 0.13) -1.99 (± 0.05)
CLIP-dissect 2.27 (± 0.09) 5.04 (± 0.05) 4.85 (± 0.03) -7.30 (± 0.03) -27.78 (± 0.09) -11.05 (± 0.12)
V-Interp 0.91 (± 0.02) 1.90 (± 0.01) 1.33 (± 0.00) -2.88 (± 0.09) -7.55 (± 0.06) -2.83 (± 0.00)
MiClip (Ours) 3.89 (± 0.03) 5.57 (± 0.02) 5.88 (± 0.03) -10.99 (± 0.02) -32.04 (± 0.20) -17.70 (± 0.02)
Table 2: Accuracy deviations of enhancement and removal interventions on neurons and features. Best performing methods are highlighted in bold. Values that contradict the expected outcome (e.g., enhancement leading to a decrease in accuracy) are marked in red.

Attention-Map Visualization of Localized Features

We investigate the spatial grounding of the learned features by examining their activations within the model's attention maps. To show the generalization of MiClip, we test both high-level concept like "kit fox" and low-level concepts like colors and textures.

Experimental results highlight that MiClip can identify feature consistently activates around the ears of the "kit fox", a key visual identifier for this class. Also, the visualization for concept "green" confirms that our method can ground foundational visual concepts, like specific colors, to their corresponding spatial locations within an image.

Top attention map visualization Bottom attention map visualization
Figure 3. We visualize the spatial grounding of localized SAE features. The attention map highlights the precise location of both high-level, specific concept like "kit fox", as well as low-level color concept "green". Images are from ImageNet-1k.

Discover Flawed Visual Reasoning in Model Prediction

We leverage MiClip to diagnose failures in visual reasoning in the CLIP/ViT-B-16 model by tracing the semantic trajectory of an image's internal representations across layers.

Diagnose on sea
Diagnose on lawn
Figure 4. The plot shows cosine similarities between layer-wise representation embeddings and text embeddings of the ground-truth (GT; blue) and misclassified label (Mis.; orange). marks the layer where GT dominates most, while marks where Mis. overtakes GT, revealing the failure point.

Contributions

  1. Challenging the "Activation-Magnitude" Assumption.
  2. An early exploration to interpret models with dual-anchored input-output grounding paradigm.
  3. An alignment framework able to discover different granularities of concepts across various representation units and layers in different models.
  4. Thorough and multi-faceted validations.

Citation

@inproceedings{
  shi2026miclip,
  title={{MICLIP}: Learning to Interpret Representation in Vision Models},
  author={Yingdong Shi and Zhiyu Yang and Changming Li and Jingyi Yu and Kan Ren},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=28Hfz8RLcD}
}