Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.
Signal-based methods extract and leverage an explicit motion signal from the source video, whereas adaptation-based methods embed motion implicitly in the model or latent space, optimizing those representations on the source video. Point-based methods guide the motion using semantic points and their trajectory. Despite their respective strengths, all three approaches struggle to satisfy the requirements above. Signal-based methods overfit to the spatial layout of the source video, often compromising edit fidelity. Adaptation-based methods often struggle to capture precise motion dynamics, leading to low motion fidelity. Point-based methods require manual annotations, resulting in degraded edit and motion fidelity when points are inaccurate. To address these challenges, we propose novel, fully-automated and motion-aligned points.
First, we define motion tokens, which are a set of extracted features using a video diffusion model. In contrast to prior works (Geyer et al. 2024; Wang et al. 2024a), where features (often referred to as “tokens") are independently extracted from each latent pixel in every frame, we track those tokens across frames and obtain distinct motion trajectories to extract more representative features of motion dynamics. From this set, we select a subset of anchor tokens that represent the most informative motion patterns in the video. Each anchor token captures a distinct local motion, and together they form a comprehensive summary of the overall motion dynamics.
Meanwhile, the new subject in the edited video may not align well with the anchor tokens extracted from the source video. To address this, we adjust the anchor tokens to better match the new layout by matching each anchor token to the motion tokens in the edited video whose feature is most similar. Then, we relocate each anchor token to its corresponding position.
[1] Yang, Xiangpeng, et al. "Videograin: Modulating space-time attention for multi-grained video editing." ICLR 2025.
[2] Yatim, Danah, et al. "Space-time diffusion features for zero-shot text-driven motion transfer." CVPR 2024.
[3] Cong, Yuren, et al. "Flatten: optical flow-guided attention for consistent text-to-video editing." ICLR 2024.
[4] Zhang, Yabo, et al. "ControlVideo: Training-free Controllable Text-to-Video Generation" ICLR 2024.
[5] Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." ICLR 2024.
[6] Zhao, Rui, et al. "Motiondirector: Motion customization of text-to-video diffusion models." ECCV 2024.
@misc{song2025pointtopointsparsemotionguidance,
title={Point-to-Point: Sparse Motion Guidance for Controllable Video Editing},
author={Yeji Song and Jaehyun Lee and Mijin Koo and JunHoo Lee and Nojun Kwak},
year={2025},
eprint={2511.18277},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.18277},
}