This is the official implementation of the paper:
"TAME: Temporal-Aware Mixture-of-Experts for Text-Video Retrieval", published in IEEE Access (2026, Volume 14). [Paper Link]
Authors: Uicheol Jung, Juyoung Hong, Hojung Kwon, and Yukyung Choi
We recommend creating a dedicated conda environment:
Recommended Environment
- OS: Ubuntu 18.04.6 LTS
- CUDA: 11.7
- Python: 3.7.16
- PyTorch: 1.13.1+cu117
- Torchvision: 0.14.1+cu117
- GPU: 4 × NVIDIA RTX A6000
Python Packages
pip install ftfy regex tqdm
pip install opencv-python boto3 requests pandasFor additional dependencies, please refer to requirements.txt.
This project relies on three standard text–video datasets: MSR-VTT, MSVD, and DiDeMo.
Please follow the instructions below to download the raw videos and obtain the official splits.
- Raw videos: download from the CVF dataset page
- Train/val/test split files:
- Provided in the collaborative-experts repository under
misc/datasets/msrvtt
- Provided in the collaborative-experts repository under
- Raw videos: available at the MSVD video description project page
- Train/val/test split files:
- Reuse the splits from collaborative-experts under
misc/datasets/msvd
- Reuse the splits from collaborative-experts under
- Raw videos: follow the download instructions from the Localizing Moments in Video repository
- Splits and additional details:
- See the DiDeMo README in collaborative-experts:
misc/datasets/didemo/README.md
- See the DiDeMo README in collaborative-experts:
- Raw videos and annotations: you must obtain permission from MPII to download and use the data.
- Download page: https://sites.google.com/site/describingmovies/download
- Test set (1,000 clips):
- Splits and additional details:
- Please refer to our paper and the dataloader implementation: dataloaders/dataloader_lsmdc_retrieval.py
- Raw videos: the official ActivityNet website provides the full dataset via Google Drive and Baidu Drive mirrors.
- Download page: http://activity-net.org/download.html
- Train/val/test split files:
- Reuse the splits from collaborative-experts under
misc/datasets/activity-net
- Reuse the splits from collaborative-experts under
To speed up training and evaluation, you can pre-compress the raw videos:
python preprocess/compress_video.py \
--input_root [RAW_VIDEO_DIR] \
--output_root [COMPRESSED_VIDEO_DIR]Before running training and evaluation, make sure that all datasets (MSR-VTT, MSVD, DiDeMo, LSMDC, ActivityNet) have been properly downloaded and prepared.
TAME is built on top of CLIP (ViT-B/32).
Download the official CLIP weights and place them under the modules/ directory:
wget -P ./modules \
https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt- Training and Evaluation Scripts
- Dataset: MSR-VTT (1K-A split)
- Backbone: CLIP ViT-B/32
ckpts/
tame_msrvtt_vitb32.pth
The main training and evaluation pipelines can be launched via the shell scripts provided in the scripts/ directory.
MSR-VTT
# Training
sh scripts/MSRVTT_Train.sh
# Evaluation
sh scripts/MSRVTT_Eval.shMSVD
# Training
sh scripts/MSVD_Train.sh
# Evaluation
sh scripts/MSVD_eval.shDiDeMo
# Training
sh scripts/DiDeMo_Train.sh
# Evaluation
sh scripts/DiDeMo_Eval.shLSMDC
# Training
sh scripts/LSMDC_Train.sh
# Evaluation
sh scripts/LSMDC_Eval.shActivityNet
# Training
sh scripts/ActivityNet_Train.sh
# Evaluation
sh scripts/ActivityNet_Eval.shThe implementation of TAME relies on resources from CLIP, CLIP4Clip, CLIP-MoE.
