-
Notifications
You must be signed in to change notification settings - Fork 88
Description
First off, huge thanks for putting the effort into such a great work.
From the paper, I understood that the Unlabeled Hybrid dataset consists of:
- K710 (~48.8% of the full dataset),
- SSV2 (12.5%),
- AVA(1.5%),
- WebVid2M(18.5%),
- Self-collected(18.5%).
As for the Labeled Hybrid, it only includes K710 (100%) dataset.
My ambiguity is that did you train the same exact K710 twice?
First, at the Pre-training stage: the K710 dataset along with other datasets. Then, at Post-pre-training.
Could you please explain the intuitive reasoning behind using the same dataset twice?
I could find this argument:
...collecting multiple labeled video datasets and building a supervised hybrid dataset can act as a bridge between the large-scale unsupervised dataset and the small-scale downstream target dataset. Progressive fine-tuning of the pre-trained models through this labeled hybrid dataset could contribute to higher performance in the downstream tasks.
Have you also tried specific fine-tuning right after the pre-training stage without the intermediate post-pre-pretraining stage?
P. S. Cited references 3 and 53 are of the same publication titled: UNIFORMERV2: SPATIOTEMPORAL LEARNING BY ARMING IMAGE VITS WITH VIDEO UNIFORMER.