Fairy: Fast Instruction-Guided Video-to-Video Synthesis

Supplementary Material

Qualitative Results

Character Swap

Stylization

Comparison

Ablation Studies

Limitations

Qualitative Results

In this section, we showcase videos generated by Fairy. We process all frames from the source video, without temporal downsampling or frame interpolation. We resize the output video's longer side to 512, and keep the aspect ratio the same.
Please scroll left or right to see more results. The first row contain source videos.

1. Character/Object Swap


"Turn into a metal knight sculpture"	"Turn into a yeti"	"Turn into a wood sculpture"	"Turn into a marble Roman sculpture"


"Turn into a bronze statue"	"Turn into a baby lion cub"	"Turn into lion"	"Turn into a vintage car"

2. Stylization


"In Van Gogh style"	"In low poly art style"


"In Monet style"	"In cubism style"


"As pencil sketch"	"In Ukiyo-E style"


"Make it Minecraft"	"Make it Tokyo"

Long Video Generation Results

Fairy is able to scale to arbitrary long video without memory issue due to the proposed anchor-based attention.
In particular, a 27-second video can be generated within 71.89 seconds via 6 A100 GPUs.


"In low poly art style"	"Turn into a knight"


"In Van Gogh style"	"As pencil sketch"

Comparison

In this section, we provide video comparison results of our method with others as mentioned in Figure 10.
The first column are the source videos, and the rest are the results of different methods.
As shown below, Fairy consistently outperform baselines in terms of consistency and instruction-faithfulness.

"In Van Gogh style"
[ Input Video ]	TokenFlow	Renderer	Gen-1	Fairy (ours)



"Turn into a wood sculpture"
[ Input Video ]	TokenFlow	Renderer	Gen-1	Fairy (ours)

Ablation Studies

In this section, we provide ablation study results of our method as mentioned in the paper.
Please refer to Sec. 6.3 in the paper for more details.

"Turn into a baby lion cub"
[ Input Video ]	+Anchor +Equivariant Finetune	+Anchor -Equivariant Finetune	-Anchor -Equivariant Finetune



"In Van Gogh style"
[ Input Video ]	+Anchor +Equivariant Finetune	+Anchor -Equivariant Finetune	-Anchor -Equivariant Finetune



"Turn into a fox"
[ Input Video ]	+Anchor +Equivariant Finetune	+Anchor -Equivariant Finetune	-Anchor -Equivariant Finetune

Ablation Studies: Number of Anchor Frames

In this section, we provide ablation study on number of anchor frames. When number of anchor frames equals to 1, the global features model can leverage are too restricted, which lead to suboptimal edits. In contrast, we observe that when the number of anchor frames is greater than 7, the quality also gradually degrades, losing some visual details.

"In cubism style"
[ Input Video ]	Number of Anchor = 1	Number of Anchor = 3	Number of Anchor = 5	Number of Anchor = 7



"Turn into a marble Roman sculpture"
[ Input Video ]	Number of Anchor = 1	Number of Anchor = 3	Number of Anchor = 5	Number of Anchor = 7

Ablation Studies: Number of Diffusion Steps

In this section, we provide ablation study on number of diffusion steps. The model perform reasonably well when the number of diffusion step is above 10. We therefore set the diffusion step to 10 for all of our experiments to optimize the latency.

"In Van Gogh style"
[ Input Video ]	Diffusion Step = 5	Diffusion Step = 10	Diffusion Step = 20	Diffusion Step = 50



"Turn into a bronze statue"
[ Input Video ]	Diffusion Step = 5	Diffusion Step = 10	Diffusion Step = 20	Diffusion Step = 50

Limitations

In this section, we show the limitations of our method.
Our model cannot accurately render dynamic visual effects, such as lightning, flames, or rain.
Inherit from the limitation of image-editing model, our model cannot follow instructions based on camera motion, e.g., zoom out.


"Add lightning"	"Add flames"	"Make it rain"	"Zoom out"