TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

¹Nanyang Technological University, Singapore
²ByteDance, Singapore
^*This work was done while Zuhao Yang was interning at ByteDance.
^✝Shijian Lu is the corresponding author.
ICCV 2025

Abstract

Video Temporal Grounding (VTG) aims to precisely identify video event segments in response to textual queries. The outputs of VTG tasks manifest as sequences of events, each defined by precise timestamps, saliency scores, and textual descriptions. Despite recent advances, a fundamental limitation persists in existing Video Large Language Models (Video-LLMs): they process all task tokens through identical and static pathways, failing to recognize that temporal localization, saliency assessment, and textual generation represent fundamentally distinct tasks requiring specialized processing. To address this, we introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency. Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications. Extensive experiments demonstrate that TimeExpert consistently achieves state-of-the-art performance on various VTG tasks such as Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.

Video Temporal Grounding (VTG)

Left: Video Temporal Grounding (VTG) is a fine‑grained video understanding task that aims to accurately locate content along with event timestamps based on natural language queries. In this work, we mainly consider three major types of VTG tasks: (1) Moment Retrieval (MR), (2) Video Highlight Detection (VHD), and (3) Dense Video Captioning (DVC). The outputs of VTG often contain textual captions, timestamps, and saliency scores. Right: Unlike existing methods (e.g., TimeChat) that employ a single static model, motivated by expert specialization on different task tokens, we propose TimeExpert, an expert‑guided Video LLM with dynamic token routing. Through task‑aware expert allocation, TimeExpert demonstrates substantial improvements over state‑of‑the‑art Video‑LLMs on several VTG benchmarks. For example, here we visualize zero‑shot F1 score for DVC on the YouCook2 dataset, R@1_IoU=0.7 for MR on the Charades‑STA dataset, and HIT@1 for VHD on the QVHighlights dataset. More results and analysis are in the experimental section.

Comparison across VTG Approaches

(a): VTG‑specific Video‑LLM relies on a single static model with shared parameters for all tasks, limiting its ability to specialize across diverse VTG subtasks. (b): Vanilla MoE improves upon this by activating a fixed set (e.g., k=2) of experts, enabling a certain degree of task specialization. (c): Our TimeExpert goes further, implementing adaptive routing that dynamically allocates new experts when tokens lack suitable matches and prunes unmatched experts when necessary. This dynamic design significantly enhances computational efficiency while achieving superior specialization—especially for VTG subtasks that require distinct feature representations.

Architecture Overview of TimeExpert

Our model leverages independent encoders and decoding heads to process time, score, and text inputs and outputs. The timestamps and saliency scores of sampled frames are encoded into special tokens and integrated into the corresponding visual tokens. During inference, the generated response follows a structured format, sequentially incorporating time tokens, score tokens, and text tokens.

BibTeX

@inproceedings{yang2025timeexpert, title={TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding}, author={Yang, Zuhao and Yu, Yingchen and Zhao, Yunqing and Lu, Shijian and Bai, Song}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, year={2025} }