SAVE: Protagonist Diversification with Structure Agnostic Video Editing

Yeji Song1, Wonsik Shin1, Junsoo Lee2, Jeesoo Kim2, Nojun Kwak1
1Seoul National University, 2NAVER Webtoon AI

Abstract

Previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one. In this paper, we spot the bias problem in the existing video editing method that restricts the range of choices for the new protagonist and attempt to address this issue using the conventional image-level personalization method. We adopt motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly. To deal with the natural discrepancy between image and video, we propose a motion word with an inflated textual embedding1 to properly represent the motion in a source video. We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow2, efficiently computed from the pre-calculated attention maps. Finally, we decouple the motion from the appearance of the source video with an additional pseudo word3. Extensive experiments demonstrate the editing capability of our method, taking a step toward more diverse and extensive video editing.


Method


Inflated Text Embedding1

We expand the textual embedding space of a motion word to represent a time flow in videos rather than a frozen moment in images: we add a temporal axis to an embedding space of our new motion word Smot and let Smot inject its information into a proper region in each frame.

Pre-registration Strategy3

The motion and the protagonist get easily entangled. To resolve this problem, we propose a two-stage training strategy to untangle the two properties. We newly define a pseudo-word Spro that represents the appearance and texture features of the protagonist. As the protagonist and its appearances are already registered in the text encoder, Smot can be effectively learned using disentangled motion information for the video.




Pseudo Optical Flow2

Our intuition lies on that if the k-th pixel of the i-th frame and the l-th pixel of the j-th frame have a high spatio-temporal attention score, then they tend to be the same semantic point at different frames. Therefore, by tracking down spatial locations of these similar points across frames, we can estimate the temporal flow of each pixel in the video. We introduce a novel pseudo optical flow to represent better the moving area without using the costly optical flow models and enabling Smot to focus on the movement.

Comparison

Motion reproduction in edited videos

As our method effectively learns the motion of the original protagonist, it generates a new protagonist that reproduces the motion in the source video seamlessly despite having a significantly different structure from that of the original one. Meanwhile, other baselines commonly generate a new protagonist in the silhouette of the original protagonist in the source video and miss the refined movements.

Ours

Tune-A-Video

Video-P2P

FateZero

A cat dog is roaring


Ours

Tune-A-Video

Video-P2P

FateZero

A child monkey is riding a bike on the road


Natural appearance of edited protagonists

Our method also effectively reflects the editing prompts compared to other baselines. Baseline methods are unable to overcome the discrepancy between an original and a new protagonist in the structure and generate an edited video where certain segments maintain the original protagonist's appearance or some flickered movements exist due to the unstable body structure of the new protagonist across frames. On the other hand, our method disentangles the appearance and the motion with separate Spro and Smot and renders the new protagonist doing Smot from the text encoder from the start, successfully applying the motion features to the new protagonist.

Ours

Tune-A-Video

Video-P2P

FateZero

A cat Pikachu is sleeping in the grass in the sun


Ours

Tune-A-Video

Video-P2P

FateZero

A man tiger is skiing

BibTeX

@misc{song2023save,
      title={SAVE: Protagonist Diversification with Structure Agnostic Video Editing}, 
      author={Yeji Song and Wonsik Shin and Junsoo Lee and Jeesoo Kim and Nojun Kwak},
      year={2023},
      eprint={2312.02503},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}