Versatile Transition Generation with Image-to-Video Diffusion

Zuhao Yang^1*, Jiahui Zhang¹, Yingchen Yu², Shijian Lu^1✝, Song Bai²

¹Nanyang Technological University, Singapore
²ByteDance, Singapore
^*This work was done while Zuhao Yang was interning at ByteDance.
^✝Shijian Lu is the corresponding author.
ICCV 2025

Paper arXiv 🤗 Bench Poster Slides Video

Abstract

Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high-quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine-tuning and representation alignment regularization to mitigate the limitations of pre-trained image-to-video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation that covers two representative transition tasks: concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across all four tasks.

Versatile Transition Generation (VTG)

Versatile Transition Generation (VTG) is capable of performing four types of transition generation, namely, (a) object morphing, (b) motion prediction, (c) concept blending, and (d) scene transition, within a single uniform framework.

Training Framework of VTG

In Bidirectional Motion Prediction (BMP), the noisy latent is flipped along the temporal dimension, and self-attention maps are rotated 180 degrees to establish reversed motion–time correlations. Two U-Nets separately predict forward and backward motion, with the backward noise flipped again and fused into the forward noise to ensure a consistent motion path during iterative denoising. In Representation Alignment Regularization (RAR), each frame is spatially patchified independently, and the per-patch alignment losses are then aggregated across the temporal dimension.

Inference Framework of VTG

Our interpolation-based initialization features three components: ① Interpolated Noise Injection, ② LoRA Interpolation, and ③ Frame-aware Text Interpolation. VTG first converts the two encoded input frames into latent noises via DDIM inversion. Next, it interpolates between those two latent noises and concatenates the intermediate noises along the temporal dimension. To capture meaningful semantics and enable smooth transitions between conceptually different objects, we employ both LoRA interpolation and text interpolation.

Visual Comparisons on Four Representative Transition Tasks

Object Morphing

First Frame

Last Frame

Prompts

Ours

DiffMorpher

TVG

SEINE

DynamiCrafter

a cat is sitting and facing to the right; a dog is sitting and facing forward

an alpaca is facing forward; an alpaca is facing to the right

a dog wearing sunglasses is facing forward; a dog wearing sunglasses is facing to the right

a sculpture; a sculpture has turned his head a bit to the right

a boy is facing forward; a boy is turning his head to the right and smiling

Motion Prediction

First Frame

Last Frame

Prompts

Ours

DiffMorpher

TVG

SEINE

DynamiCrafter

a camel is walking slowly to the right

a woman is swinging on a swing

a woman is riding a horse

a man is surfing

a woman is walking

Concept Blending

First Frame

Last Frame

Prompts

Ours

DiffMorpher

TVG

SEINE

DynamiCrafter

a lion; a truck

an airplane; a cruise

a bathtub; a bookshelf

a bench; a cup of tea

a merry-go-round; a mouse

Scene Transition

First Frame

Last Frame

Prompts

Ours

DiffMorpher

TVG

SEINE

DynamiCrafter

wide angle shot of an erupting volcano; close-up shot of hot lava

close-up shot of a blooming cheery tree; wide angle shot of an alien cheery blossom forest

overhead view of a bustling city street; aerial view of a futuristic city

close-up shot of a candle lit; wide angle shot of a mystical land

a wooden house in the forest; a wooden house in the snow

More Challenging Examples

Challenging Motion Prediction

First Frame

Last Frame

Prompts

Ours

DiffMorpher

TVG

SEINE

DynamiCrafter

a dog jumps out of the swimming pool with a tennis ball in its mouth

a little girl is blowing bubbles

a jet plane is flying over a forest

a little boy and a little girl are playing with mud on the beach

a brown bear is walking slowly

Challenging Scene Transition

First Frame

Last Frame

Prompts

Ours

DiffMorpher

TVG

SEINE

DynamiCrafter

a comic-style road with a yellowish hue; a highway in real world

a cartoon-style robot factory; a futuristic, high-tech robot that is painting

Motion Pattern Summary

Successful Cases

First Frame

Last Frame

Prompts

Annotated Frame

Annotation Types

VTG

Motion Pattern

a camel is walking slowly to the right

camel's moving trajectory (red) & camera movement (green)

straight object trajectory with small camera movement

a dog jumps out of the swimming pool with a tennis ball in its mouth

dog's moving trajectory (red) & camera movement (green)

curved object trajectory with small camera movement

a dandelion is shaking

dandelion's shaking trajectory (red) & camera movement (green)

shaking object trajectory with small camera movement

a jet plane is flying over a forest

jet's moving trajectory (red) & changing background (blue) & camera movement (green)

relative movement with changing background

Failed Cases

First Frame

Last Frame

Prompts

Annotated Frame

Annotation Types

VTG

Motion Pattern

a car is turning

car's moving trajectory (red) & camera movement (green)

complex object trajectory with large camera movement

a man is skateboarding uphill

man's two-stage moving trajectory (red) & camera movement (green)

composite object trajectory with large camera movement

a car is moving in high speed

car's moving trajectory (red) & almost unchanged background (blue)

relative movement with almost unchanged background

BibTeX


      @inproceedings{yang2025versatile,
        title={Versatile Transition Generation with Image-to-Video Diffusion},
        author={Yang, Zuhao and Zhang, Jiahui and Yu, Yingchen and Lu, Shijian and Bai, Song},
        booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
        pages={16981--16990},
        year={2025}
      }