Abstract

In this paper, we introduce Fairy, a minimalist yet robust adaptation of image-editing diffusion models, enhancing them for video editing applications. Our approach centers on the concept of anchor-based cross-frame attention, a mechanism that implicitly propagates diffusion features across frames, ensuring superior temporal coherence and high-fidelity synthesis. Fairy not only addresses limitations of previous models, including memory and processing speed. It also improves temporal consistency through a unique data augmentation strategy. This strategy renders the model equivariant to affine transformations in both source and target images. Remarkably efficient, Fairy generates 120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds, outpacing prior works by at least 44x. A comprehensive user study, involving 1000 generated samples, confirms that our approach delivers superior quality, decisively outperforming established methods.

Method

Fairy re-examines the tracking-and-propagation paradigm under the context of diffusion model features. In particular, we bridge cross-frame attention with correspondence estimation, showing that it temporally tracks and propagates intermediate features inside a diffusion model. The cross-frame attention map can be interpreted as a similarity metric assessing the correspondence between tokens throughout various frames, where features from one semantic region will assign higher attention to similar semantic regions in other frames, as shown in the following figure (Fig. 3). Consequently, the current feature representations are refined and propagated through a weighted sum of similar regions across frames via attention, effectively minimizing feature disparity between frames, which translates to improved temporal consistency.

Figure 3: Visualization of Attention Score. The left image shows the query point $p$ within the current frame, and the right image is the target frame. Cross-frame attention performs accurate temporal correspondence estimation without any finetuning.

The analysis gives rise to our anchor-based model, the central component of Fairy. To ensure temporal consistency, we sample K anchor frames from which we extract diffusion features, and the extracted features define a set global features to be propagated to successive frames. When generating each new frame, we replace the self-attention layer with cross-frame attention with respect to the cached features of anchor frames. With cross-frame attention, the tokens in each frame take the features in anchor frames that exhibit analogous semantic content, thereby enhancing consistency.

Figure 4: Illustration of Attention Blocks (a) Given a set of anchor frames, we extract and cache the attention feature \(K_{anc}\) and \(V_{anc}\). (b) Given an input frame, we perform cross-frame attention with respect to the cached features of anchor frames.

Qualitative Results

In this section, we showcase videos generated by Fairy. We process all frames from the source video, without temporal downsampling or frame interpolation. We resize the output video's longer side to 512, and keep the aspect ratio the same.
Please scroll left or right to see more results. The first row contains source videos.

1. Character/Object Swap

"Turn into a metal knight sculpture" "Turn into a yeti" "Turn into a wood sculpture" "Turn into a marble Roman sculpture"
"Turn into a bronze statue" "Turn into a baby lion cub" "Turn into lion" "Turn into a vintage car"

2. Stylization

"In Van Gogh style" "In low poly art style"
"In Monet style" "In cubism style"
"As pencil sketch" "In Ukiyo-E style"
"Make it Minecraft" "Make it Tokyo"

3. Long Video Generation Results

Fairy is able to scale to arbitrary long video without memory issue due to the proposed anchor-based attention.
In particular, a 27-second video can be generated within 71.89 seconds via 6 A100 GPUs.

"In low poly art style" "Turn into a knight"
"In Van Gogh style" "As pencil sketch"

User study

In this paper, we conduct a large-scale user study on an evaluation set consists of 1000 video-instruction samples. To our best knowledge, this is the largest evaluation in the video- to-video generation literature so far.
We compare extensively with existing methods: TokenFlow, Renderer, and Gen-1.
Please look at Supplimentary videos for qualitative comparsion!

Figure 7. A/B Comparison with Baselines. Fairy significantly surpassed baseline models, demonstrating its effectivity.

BibTeX


@article{wu2023fairy,
    title={Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis},
    author={Wu, Bichen and Chuang, Ching-Yao and Wang, Xiaoyan and Jia, Yichen and Krishnakumar, Kapil and Xiao, Tong and Liang, Feng and Yu, Licheng and Vajda, Peter},
    journal={arXiv preprint arXiv:2312.13834},
    year={2023}
}