Splicing ViT Features for Semantic Appearance Transfer

Supplementary Material

 

ViT Features Visualization



We show feature inversion and PCA self-similarity visualizations for supervised ViT [8] on the same examples included in the paper (Fig. 3-5) for DINO-ViT [4]. As can be seen, the features contain detailed information of the input image, however the singal is noisier in supervised ViT compared to DINO-ViT, as discussed in Sec.3.1.

 

Supervised ViT Features Visualization

Input Images
Key Inversions (layer 11)
PCA of keys’ self-similarity (3 leading components, layer 11)
[CLS] token inversions (layer 11)
 

DINO-ViT Feature Inversion w/o Image Prior

As discussed in Sec. 3.2, feature inversion w/o any regularization, i.e., optimizing directly the image pixels, is insufficient for converging into a meaningful result.

 

Key Inversions (layer 11), w/o Deep Image Prior (DIP)
[CLS] Token Inversions (layer 11), w/o DIP