SAVE: Protagonist Diversification with Structure Agnostic Video Editing

Abstract

Previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one. In this paper, we spot the bias problem in the existing video editing method that restricts the range of choices for the new protagonist and attempt to address this issue using the conventional image-level personalization method. We adopt motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly. To deal with the natural discrepancy between image and video, we propose a motion word with an inflated textual embedding¹ to properly represent the motion in a source video. We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow², efficiently computed from the pre-calculated attention maps. Finally, we decouple the motion from the appearance of the source video with an additional pseudo word³. Extensive experiments demonstrate the editing capability of our method, taking a step toward more diverse and extensive video editing.

Method

Inflated Text Embedding¹

We expand the textual embedding space of a motion word to represent a time flow in videos rather than a frozen moment in images: we add a temporal axis to an embedding space of our new motion word S_mot and let S_mot inject its information into a proper region in each frame.

Pre-registration Strategy³

The motion and the protagonist get easily entangled. To resolve this problem, we propose a two-stage training strategy to untangle the two properties. We newly define a pseudo-word S_pro that represents the appearance and texture features of the protagonist. As the protagonist and its appearances are already registered in the text encoder, S_mot can be effectively learned using disentangled motion information for the video.

Pseudo Optical Flow²

Our intuition lies on that if the k-th pixel of the i-th frame and the l-th pixel of the j-th frame have a high spatio-temporal attention score, then they tend to be the same semantic point at different frames. Therefore, by tracking down spatial locations of these similar points across frames, we can estimate the temporal flow of each pixel in the video. We introduce a novel pseudo optical flow to represent better the moving area without using the costly optical flow models and enabling S_mot to focus on the movement.

Comparison

Motion reproduction in edited videos

As our method effectively learns the motion of the original protagonist, it generates a new protagonist that reproduces the motion in the source video seamlessly despite having a significantly different structure from that of the original one. Meanwhile, other baselines commonly generate a new protagonist in the silhouette of the original protagonist in the source video and miss the refined movements.

Ours

Tune-A-Video

Video-P2P

FateZero

A ~~cat~~ dog is roaring

Ours

Tune-A-Video

Video-P2P

FateZero

A ~~child~~ monkey is riding a bike on the road

Natural appearance of edited protagonists

Our method also effectively reflects the editing prompts compared to other baselines. Baseline methods are unable to overcome the discrepancy between an original and a new protagonist in the structure and generate an edited video where certain segments maintain the original protagonist's appearance or some flickered movements exist due to the unstable body structure of the new protagonist across frames. On the other hand, our method disentangles the appearance and the motion with separate S_pro and S_mot and renders the new protagonist doing S_mot from the text encoder from the start, successfully applying the motion features to the new protagonist.

Ours

Tune-A-Video

Video-P2P

FateZero

A ~~cat~~ Pikachu is sleeping in the grass in the sun

Ours

Tune-A-Video

Video-P2P

FateZero

A ~~man~~ tiger is skiing

BibTeX

@misc{song2023save,
      title={SAVE: Protagonist Diversification with Structure Agnostic Video Editing}, 
      author={Yeji Song and Wonsik Shin and Junsoo Lee and Jeesoo Kim and Nojun Kwak},
      year={2023},
      eprint={2312.02503},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

Abstract

Method

Inflated Text Embedding1

Pre-registration Strategy3

Pseudo Optical Flow2

Comparison

Motion reproduction in edited videos

Natural appearance of edited protagonists

BibTeX

Inflated Text Embedding¹

Pre-registration Strategy³

Pseudo Optical Flow²