Versatile Transition Generation with Image-to-Video Diffusion

1Nanyang Technological University, Singapore
2ByteDance, Singapore
*This work was done while Zuhao Yang was interning at ByteDance.

Shijian Lu is the corresponding author.

ICCV 2025

Abstract

Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high-quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine-tuning and representation alignment regularization to mitigate the limitations of pre-trained image-to-video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation that covers two representative transition tasks: concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across all four tasks.

Versatile Transition Generation (VTG)

Task Definition of VTG.

Versatile Transition Generation (VTG) is capable of performing four types of transition generation, namely, (a) object morphing, (b) motion prediction, (c) concept blending, and (d) scene transition, within a single uniform framework.

Training Framework of VTG

Training Framework of VTG.

In Bidirectional Motion Prediction (BMP), the noisy latent is flipped along the temporal dimension, and self-attention maps are rotated 180 degrees to establish reversed motion–time correlations. Two U-Nets separately predict forward and backward motion, with the backward noise flipped again and fused into the forward noise to ensure a consistent motion path during iterative denoising. In Representation Alignment Regularization (RAR), each frame is spatially patchified independently, and the per-patch alignment losses are then aggregated across the temporal dimension.

Inference Framework of VTG

Inference Framework of VTG.

Our interpolation-based initialization features three components: Interpolated Noise Injection, LoRA Interpolation, and Frame-aware Text Interpolation. VTG first converts the two encoded input frames into latent noises via DDIM inversion. Next, it interpolates between those two latent noises and concatenates the intermediate noises along the temporal dimension. To capture meaningful semantics and enable smooth transitions between conceptually different objects, we employ both LoRA interpolation and text interpolation.

Visual Comparisons on Four Representative Transition Tasks

Object Morphing

First Frame
Last Frame
Prompts
Ours
DiffMorpher
TVG
SEINE
DynamiCrafter
GI
a cat is sitting and facing to the right; a dog is sitting and facing forward
an alpaca is facing forward; an alpaca is facing to the right
a dog wearing sunglasses is facing forward; a dog wearing sunglasses is facing to the right
a sculpture; a sculpture has turned his head a bit to the right
a boy is facing forward; a boy is turning his head to the right and smiling

Motion Prediction

First Frame
Last Frame
Prompts
Ours
DiffMorpher
TVG
SEINE
DynamiCrafter
GI
a camel is walking slowly to the right
a woman is swinging on a swing
a woman is riding a horse
a man is surfing
a woman is walking

Concept Blending

First Frame
Last Frame
Prompts
Ours
DiffMorpher
TVG
SEINE
DynamiCrafter
GI
a lion; a truck
an airplane; a cruise
a bathtub; a bookshelf
a bench; a cup of tea
a merry-go-round; a mouse

Scene Transition

First Frame
Last Frame
Prompts
Ours
DiffMorpher
TVG
SEINE
DynamiCrafter
GI
wide angle shot of an erupting volcano; close-up shot of hot lava
close-up shot of a blooming cheery tree; wide angle shot of an alien cheery blossom forest
overhead view of a bustling city street; aerial view of a futuristic city
close-up shot of a candle lit; wide angle shot of a mystical land
a wooden house in the forest; a wooden house in the snow

More Challenging Examples

Challenging Motion Prediction

First Frame
Last Frame
Prompts
Ours
DiffMorpher
TVG
SEINE
DynamiCrafter
GI
a dog jumps out of the swimming pool with a tennis ball in its mouth
a little girl is blowing bubbles
a jet plane is flying over a forest
a little boy and a little girl are playing with mud on the beach
a brown bear is walking slowly

Challenging Scene Transition

First Frame
Last Frame
Prompts
Ours
DiffMorpher
TVG
SEINE
DynamiCrafter
GI
a comic-style road with a yellowish hue; a highway in real world
a cartoon-style robot factory; a futuristic, high-tech robot that is painting

Motion Pattern Summary

Successful Cases

First Frame
Last Frame
Prompts
Annotated Frame
Annotation Types
VTG
Motion Pattern
a camel is walking slowly to the right
camel's moving trajectory (red) & camera movement (green)
straight object trajectory with small camera movement
a dog jumps out of the swimming pool with a tennis ball in its mouth
dog's moving trajectory (red) & camera movement (green)
curved object trajectory with small camera movement
a dandelion is shaking
dandelion's shaking trajectory (red) & camera movement (green)
shaking object trajectory with small camera movement
a jet plane is flying over a forest
jet's moving trajectory (red) & changing background (blue) & camera movement (green)
relative movement with changing background

Failed Cases

First Frame
Last Frame
Prompts
Annotated Frame
Annotation Types
VTG
Motion Pattern
a car is turning
car's moving trajectory (red) & camera movement (green)
complex object trajectory with large camera movement
a man is skateboarding uphill
man's two-stage moving trajectory (red) & camera movement (green)
composite object trajectory with large camera movement
a car is moving in high speed
car's moving trajectory (red) & almost unchanged background (blue)
relative movement with almost unchanged background

BibTeX


      @inproceedings{yang2025versatile,
        title={Versatile Transition Generation with Image-to-Video Diffusion},
        author={Yang, Zuhao and Zhang, Jiahui and Yu, Yingchen and Lu, Shijian and Bai, Song},
        booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
        year={2025}
      }