Splicing ViT Features for Semantic Appearance Transfer

Supplementary Material

ViT Features Visualization

We show feature inversion and PCA self-similarity visualizations for supervised ViT [8] on the same examples included in the paper (Fig. 3-5) for DINO-ViT [4]. As can be seen, the features contain detailed information of the input image, however the singal is noisier in supervised ViT compared to DINO-ViT, as discussed in Sec.3.1.

Supervised ViT Features Visualization

Input Images

Key Inversions (layer 11)

PCA of keys’ self-similarity (3 leading components, layer 11)

[CLS] token inversions (layer 11)

DINO-ViT Feature Inversion w/o Image Prior

As discussed in Sec. 3.2, feature inversion w/o any regularization, i.e., optimizing directly the image pixels, is insufficient for converging into a meaningful result.

Key Inversions (layer 11), w/o Deep Image Prior (DIP)

[CLS] Token Inversions (layer 11), w/o DIP