Skip to content

sejong-rcv/TAME

Repository files navigation

TAME : Temporal-Aware Mixture-of-Experts for Text-Video Retrieval

This is the official implementation of the paper:

"TAME: Temporal-Aware Mixture-of-Experts for Text-Video Retrieval", published in IEEE Access (2026, Volume 14). [Paper Link]

Authors: Uicheol Jung, Juyoung Hong, Hojung Kwon, and Yukyung Choi


Requirements

We recommend creating a dedicated conda environment:

Recommended Environment

  • OS: Ubuntu 18.04.6 LTS
  • CUDA: 11.7
  • Python: 3.7.16
  • PyTorch: 1.13.1+cu117
  • Torchvision: 0.14.1+cu117
  • GPU: 4 × NVIDIA RTX A6000

Python Packages

pip install ftfy regex tqdm
pip install opencv-python boto3 requests pandas

For additional dependencies, please refer to requirements.txt.

Data Preparation

This project relies on three standard text–video datasets: MSR-VTT, MSVD, and DiDeMo.
Please follow the instructions below to download the raw videos and obtain the official splits.

MSR-VTT

MSVD

DiDeMo

LSMDC

ActivityNet


Optional: Video Compression for Faster I/O

To speed up training and evaluation, you can pre-compress the raw videos:

python preprocess/compress_video.py \
  --input_root [RAW_VIDEO_DIR] \
  --output_root [COMPRESSED_VIDEO_DIR]

How to Run

1. Prepare the data

Before running training and evaluation, make sure that all datasets (MSR-VTT, MSVD, DiDeMo, LSMDC, ActivityNet) have been properly downloaded and prepared.

2. Download the pretrained CLIP checkpoint

TAME is built on top of CLIP (ViT-B/32).
Download the official CLIP weights and place them under the modules/ directory:

wget -P ./modules \
  https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
  1. Training and Evaluation Scripts

MSR-VTT (Text-to-Video Retrieval)

  • Dataset: MSR-VTT (1K-A split)
  • Backbone: CLIP ViT-B/32
ckpts/
  tame_msrvtt_vitb32.pth

The main training and evaluation pipelines can be launched via the shell scripts provided in the scripts/ directory.

MSR-VTT

# Training
sh scripts/MSRVTT_Train.sh
# Evaluation
sh scripts/MSRVTT_Eval.sh

MSVD

# Training
sh scripts/MSVD_Train.sh
# Evaluation
sh scripts/MSVD_eval.sh

DiDeMo

# Training
sh scripts/DiDeMo_Train.sh
# Evaluation
sh scripts/DiDeMo_Eval.sh

LSMDC

# Training
sh scripts/LSMDC_Train.sh
# Evaluation
sh scripts/LSMDC_Eval.sh

ActivityNet

# Training
sh scripts/ActivityNet_Train.sh
# Evaluation
sh scripts/ActivityNet_Eval.sh

Acknowledgments

The implementation of TAME relies on resources from CLIP, CLIP4Clip, CLIP-MoE.

About

TAME-Temporal-Aware-Mixture-of-Experts-for-Text-Video-Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors