Point-to-Point: Sparse Motion Guidance for Controllable Video Editing

Yeji Song, Jaehyun Lee, Mijin Koo, JunHoo Lee, Nojun Kwak
Seoul National University

Abstract

Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.


Motivation

Signal-based methods extract and leverage an explicit motion signal from the source video, whereas adaptation-based methods embed motion implicitly in the model or latent space, optimizing those representations on the source video. Point-based methods guide the motion using semantic points and their trajectory. Despite their respective strengths, all three approaches struggle to satisfy the requirements above. Signal-based methods overfit to the spatial layout of the source video, often compromising edit fidelity. Adaptation-based methods often struggle to capture precise motion dynamics, leading to low motion fidelity. Point-based methods require manual annotations, resulting in degraded edit and motion fidelity when points are inaccurate.


Method

Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.


BibTeX

@misc{song2023save,
      title={SAVE: Protagonist Diversification with Structure Agnostic Video Editing}, 
      author={Yeji Song and Wonsik Shin and Junsoo Lee and Jeesoo Kim and Nojun Kwak},
      year={2023},
      eprint={2312.02503},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}