TAME : Temporal-Aware Mixture-of-Experts for Text-Video Retrieval

This is the official implementation of the paper:

"TAME: Temporal-Aware Mixture-of-Experts for Text-Video Retrieval", published in IEEE Access (2026, Volume 14). [Paper Link]

Authors: Uicheol Jung, Juyoung Hong, Hojung Kwon, and Yukyung Choi

Requirements

We recommend creating a dedicated conda environment:

Recommended Environment

OS: Ubuntu 18.04.6 LTS
CUDA: 11.7
Python: 3.7.16
PyTorch: 1.13.1+cu117
Torchvision: 0.14.1+cu117
GPU: 4 × NVIDIA RTX A6000

Python Packages

pip install ftfy regex tqdm
pip install opencv-python boto3 requests pandas

For additional dependencies, please refer to requirements.txt.

Data Preparation

This project relies on three standard text–video datasets: MSR-VTT, MSVD, and DiDeMo.
Please follow the instructions below to download the raw videos and obtain the official splits.

MSR-VTT

Raw videos: download from the CVF dataset page
- https://cove.thecvf.com/datasets/839
Train/val/test split files:
- Provided in the collaborative-experts repository under
  misc/datasets/msrvtt

MSVD

Raw videos: available at the MSVD video description project page
- https://www.cs.utexas.edu/~ml/clamp/videoDescription/
Train/val/test split files:
- Reuse the splits from collaborative-experts under
  misc/datasets/msvd

DiDeMo

Raw videos: follow the download instructions from the Localizing Moments in Video repository
- https://github.com/LisaAnne/LocalizingMoments
Splits and additional details:
- See the DiDeMo README in collaborative-experts:
  misc/datasets/didemo/README.md

LSMDC

Raw videos and annotations: you must obtain permission from MPII to download and use the data.
- Download page: https://sites.google.com/site/describingmovies/download
Test set (1,000 clips):
- CSV link: http://www.google.com/url?q=http%3A%2F%2Fdatasets.d2.mpi-inf.mpg.de%2FmovieDescription%2Fprotected%2Flsmdc2016%2FLSMDC16_challenge_1000_publictect.csv&sa=D&sntz=1&usg=AFQjCNGIaGVhCeb6zNfUs2UL1zNzoEtaSg
Splits and additional details:
- Please refer to our paper and the dataloader implementation: dataloaders/dataloader_lsmdc_retrieval.py

ActivityNet

Raw videos: the official ActivityNet website provides the full dataset via Google Drive and Baidu Drive mirrors.
- Download page: http://activity-net.org/download.html
Train/val/test split files:
- Reuse the splits from collaborative-experts under
  misc/datasets/activity-net

Optional: Video Compression for Faster I/O

To speed up training and evaluation, you can pre-compress the raw videos:

python preprocess/compress_video.py \
  --input_root [RAW_VIDEO_DIR] \
  --output_root [COMPRESSED_VIDEO_DIR]

How to Run

1. Prepare the data

_{Before running training and evaluation, make sure that all datasets (MSR-VTT, MSVD, DiDeMo, LSMDC, ActivityNet) have been properly downloaded and prepared.}

2. Download the pretrained CLIP checkpoint

TAME is built on top of CLIP (ViT-B/32).
Download the official CLIP weights and place them under the modules/ directory:

wget -P ./modules \
  https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt

Training and Evaluation Scripts

MSR-VTT (Text-to-Video Retrieval)

Dataset: MSR-VTT (1K-A split)
Backbone: CLIP ViT-B/32

ckpts/
  tame_msrvtt_vitb32.pth

The main training and evaluation pipelines can be launched via the shell scripts provided in the scripts/ directory.

MSR-VTT

# Training
sh scripts/MSRVTT_Train.sh
# Evaluation
sh scripts/MSRVTT_Eval.sh

MSVD

# Training
sh scripts/MSVD_Train.sh
# Evaluation
sh scripts/MSVD_eval.sh

DiDeMo

# Training
sh scripts/DiDeMo_Train.sh
# Evaluation
sh scripts/DiDeMo_Eval.sh

LSMDC

# Training
sh scripts/LSMDC_Train.sh
# Evaluation
sh scripts/LSMDC_Eval.sh

ActivityNet

# Training
sh scripts/ActivityNet_Train.sh
# Evaluation
sh scripts/ActivityNet_Eval.sh

Acknowledgments

The implementation of TAME relies on resources from CLIP, CLIP4Clip, CLIP-MoE.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataloaders		dataloaders
fig		fig
modules		modules
preprocess		preprocess
scripts		scripts
.gitignore		.gitignore
README.md		README.md
main_task_retrieval.py		main_task_retrieval.py
metrics.py		metrics.py
requirements.txt		requirements.txt
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TAME : Temporal-Aware Mixture-of-Experts for Text-Video Retrieval

Requirements

Data Preparation

MSR-VTT

MSVD

DiDeMo

LSMDC

ActivityNet

Optional: Video Compression for Faster I/O

How to Run

1. Prepare the data

_{Before running training and evaluation, make sure that all datasets (MSR-VTT, MSVD, DiDeMo, LSMDC, ActivityNet) have been properly downloaded and prepared.}

2. Download the pretrained CLIP checkpoint

MSR-VTT (Text-to-Video Retrieval)

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TAME : Temporal-Aware Mixture-of-Experts for Text-Video Retrieval

Requirements

Data Preparation

MSR-VTT

MSVD

DiDeMo

LSMDC

ActivityNet

Optional: Video Compression for Faster I/O

How to Run

1. Prepare the data

Before running training and evaluation, make sure that all datasets (MSR-VTT, MSVD, DiDeMo, LSMDC, ActivityNet) have been properly downloaded and prepared.

2. Download the pretrained CLIP checkpoint

MSR-VTT (Text-to-Video Retrieval)

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

_{Before running training and evaluation, make sure that all datasets (MSR-VTT, MSVD, DiDeMo, LSMDC, ActivityNet) have been properly downloaded and prepared.}

Packages