Patch Explorer

How do Diffusion Models Encode Semantic Concepts?

Latent diffusion models operate in a compressed latent space rather than directly in pixel space. The latent space is organized into patches, which are spatial units that correspond to regions in the output image. At every layer of the model's U-Net, here laid out horizontally, multiple attention heads, ordered vertically, perform the attention mechanism in parallel and over several timesteps. Patch Explorer lets users interfere in the generation process by applying interventions to the patches as they are processed.

**Interface components of Patch Explorer:** After (1) providing a text prompt and a seed to generate (2) an image, users can inspect the (3) patch grids of the individual attention heads visualized. Magenta represents positive and cyan represents negative activation. A slider (4) lets users inspect timesteps for a selected range. For the selected timestep range, users can choose to (5) an intervention to apply to selected patch grids.

Direct Manipulation of Cross-Attention Heads

At the core of diffusion models is the attention mechanism, which enables content-based interactions between different spatial locations. In the cross-attention layers of diffusion models, the K and V matrices are derived from text encodings, while Q comes from the image representation. We propose to target the input and output of cross-attention heads through direct manipulation.

**Direct Manipulation of the cross-attention mechanism:** A (1) **Patch Grid** offers a representation to spatially target the outputs of attention heads with interventions. The intervention (2) **Scaling** multiplies the output of the attention head by a given scalar, while (3) **Encoding** replaces the output for targeted patches with the output for an alternative text encoding provided by the user.

In that way, patch grids become a new interaction modality, letting users interfere with the internal states of the model:

**Interacting with patch grids:** The user chooses a patch grid by clicking on it, after which holding the shift lets them ''draw'' by moving the mouse over patches to select them, which marks them green. Clicking again allows users to quickly select many attention heads.

Can we find Specific Visual Concepts through Interaction?

The interface lets users explore the role of cross-attention heads in the generation process. For example, we find that two attention heads are responsible for generating the horn on the head of a unicorn.

**Inspecting attention heads:** By adjusting the timestep slider, the user can inspect how the horn feature evolves over time at Layer 9 Head 3 and 4.

By interacting with these attention heads, images can be altered, e.g. through scaling.

**Applying interventions with Patch Explorer:** In the (1) dropdown menu, the user selects an intervention, like Scaling, and applies it to the selected (2) timestep range by (3) selecting the desired patches.

Scaling the attention heads to 0 ablates the attention heads' affect to the residual stream and has the affect that the unicorn's horn disappers. Increasing their effect, on the other hand, amplifies the feature, not only for unicorn horns:

Scaling the two attention heads amplifies or removes the horn not only for unicorns, but also for other horned animals, confirming these heads' general role in generating horns.

The attention heads can be used to transfer the visual feature to other horse-like concepts. For example, for the prompt ''Pegasus'', a unicorn horn can be added by encoding ''unicorn'' into relevant patches. Additionally, we find that the Pegasus turns into a regular horse when scaling down the influence of patches at Layer 8, Head 7, which seems to be responsible for generating its wing.

Restricting interventions to specific timestep ranges shows how features are formed throughout the generation process, like the unicorn horn on the horse' head, or the Pegasus' wings.

**Evolution of horn over timesteps:** By encoding the prompt ''unicorn'' at the relevant attention heads for a growing number of timesteps, we can observe how a horn is added to a horse.

**Evolution of wings over timesteps:** By gradually increasing the contribution of the attention head that causes the Pegasus' wings, we can inspect how they are formed over timesteps.

For a detailed usage scenario with more examples, take a look at our paper linked above.

How to cite

The paper can be cited as follows.

bibliography

Imke Grabe, Jaden Fiotto-Kaufman, Rohit Gandikota, David Bau. "Patch Explorer: Interpreting Diffusion Models through Interaction." Mechanistic Interpretability for Vision at CVPR 2025 (Non-proceedings Track).

bibtex

  @inproceedings{
    grabe2025patch,
    title={Patch Explorer: Interpreting Diffusion Models through Interaction},
    author={Imke Grabe and Jaden Fiotto-Kaufman and Rohit Gandikota and David Bau},
    booktitle={Mechanistic Interpretability for Vision at CVPR 2025 (Non-proceedings Track)},
    year={2025},
    url={https://openreview.net/forum?id=0n9wqVyHas}
    }

Patch Explorer: Interpreting Diffusion Models through Interaction