In this section, we showcase videos generated by Fairy. We process all frames from the source video, without temporal downsampling or frame interpolation. We resize the output video's longer side to 512, and keep the aspect ratio the same.
Please scroll left or right to see more results. The first row contain source
videos.
Fairy is able to scale to arbitrary long video without memory issue due to the proposed anchor-based attention.
In particular, a 27-second video can be generated within 71.89 seconds via 6 A100 GPUs.
In this section, we provide video comparison results of our method with others as mentioned in Figure 10.
The first column are the source videos, and the rest are the results of different methods.
As shown below, Fairy consistently outperform baselines in terms of consistency and instruction-faithfulness.
In this section, we provide ablation study results of our method as mentioned in the paper.
Please refer to Sec. 6.3 in the paper for more details.
"Turn into a baby lion cub" | ||||
---|---|---|---|---|
[ Input Video ] | +Anchor +Equivariant Finetune | +Anchor -Equivariant Finetune | -Anchor -Equivariant Finetune | |
"In Van Gogh style" | ||||
[ Input Video ] | +Anchor +Equivariant Finetune | +Anchor -Equivariant Finetune | -Anchor -Equivariant Finetune | |
"Turn into a fox" | ||||
[ Input Video ] | +Anchor +Equivariant Finetune | +Anchor -Equivariant Finetune | -Anchor -Equivariant Finetune | |
In this section, we provide ablation study on number of anchor frames. When number of anchor frames equals to 1, the global features model can leverage are too restricted, which lead to suboptimal edits. In contrast, we observe that when the number of anchor frames is greater than 7, the quality also gradually degrades, losing some visual details.
In this section, we provide ablation study on number of diffusion steps. The model perform reasonably well when the number of diffusion step is above 10. We therefore set the diffusion step to 10 for all of our experiments to optimize the latency.
In this section, we show the limitations of our method.
Our model cannot accurately render dynamic visual effects, such as lightning, flames, or rain.
Inherit from the limitation of image-editing model, our model cannot follow instructions based on camera motion,
e.g., zoom out.
"Add lightning" | "Add flames" | "Make it rain" | "Zoom out" |
---|---|---|---|