Disentangling Structure and Appearance in ViT Feature Space

Supplementary Material

SpliceNet Ablations

Inlier / Outlier Examples

We present examples of inliers and outliers acquired using our pairing method (Sec. 3.5 in the paper).

Dogs

Source Image	Inliers				Outliers

Rejected images:

Horses

Source Image	Inliers				Outliers

Rejected images:

Pairing Ablation

We show results generated by SpliceNet with (i) training with dataset distillation, (ii) training without dataset distillation. Evidently, the model manages to transfer semantic regions in a more coherent manner when trained with our distillation method.

Appearance	Structure	SpliceNet w/ pairing	SpliceNet w/o pairing

Appearance	Structure	SpliceNet w/ pairing	SpliceNet w/o pairing

Appearance	Structure	SpliceNet w/ pairing	SpliceNet w/o pairing

Appearance	Structure	SpliceNet w/ pairing	SpliceNet w/o pairing

Appearance	Structure	SpliceNet w/ pairing	SpliceNet w/o pairing

Appearance	Structure	SpliceNet w/ pairing	SpliceNet w/o pairing

Appearance	Structure	SpliceNet w/ pairing	SpliceNet w/o pairing

Appearance	Structure	SpliceNet w/ pairing	SpliceNet w/o pairing

CNN Baselines

We show results generated by SpliceNet with (i) recieving the [CLS] as input (ii) receiving the apearance image as input (i.e. CNN baseline). Evidently, the model conditioned on the [CLS] token manages to transfer more complex texture (e.g. fur, different colors in different parts) than the CNN baseline.

Appearance	Structure	SpliceNet	SpliceNet CNN Baseline

Appearance	Structure	SpliceNet	SpliceNet CNN Baseline

Appearance	Structure	SpliceNet	SpliceNet CNN Baseline

Appearance	Structure	SpliceNet	SpliceNet CNN Baseline

Appearance	Structure	SpliceNet	SpliceNet CNN Baseline

Appearance	Structure	SpliceNet	SpliceNet CNN Baseline

Appearance	Structure	SpliceNet	SpliceNet CNN Baseline

Appearance	Structure	SpliceNet	SpliceNet CNN Baseline

Appearance	Structure	SpliceNet	SpliceNet CNN Baseline