Unlabeled Hybrid and Labeled Hybrid dataset usage ambiguity in training stages

**First off**, huge thanks for putting the effort into such a great work.

From the paper, I understood that the **Unlabeled Hybrid dataset** consists of:
1. **K710** (~48.8% of the full dataset), 
2. **SSV2** (12.5%), 
3. **AVA**(1.5%), 
4. **WebVid2M**(18.5%), 
5. **Self-collected**(18.5%).

As for the **Labeled Hybrid**, it only includes **K710** (100%) dataset.

My ambiguity is that did you train the same exact K710 twice? 
First, at the **Pre-training stage**: the K710 dataset along with other datasets. Then, at **Post-pre-training**. 

Could you please explain the intuitive reasoning behind using the same dataset twice? 
I could find this argument: 

> ...collecting multiple labeled video datasets and building a supervised hybrid dataset can act as a bridge between the large-scale unsupervised dataset and the small-scale downstream target dataset. Progressive fine-tuning of the pre-trained models through this labeled hybrid dataset could contribute to higher performance in the downstream tasks.

Have you also tried **specific fine-tuning** right after **the pre-training stage** without the intermediate **post-pre-pretraining stage**?

P. S. Cited references 3 and 53 are of the same [publication ](https://arxiv.org/pdf/2211.09552)titled: _UNIFORMERV2: SPATIOTEMPORAL LEARNING BY ARMING IMAGE VITS WITH VIDEO UNIFORMER_.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unlabeled Hybrid and Labeled Hybrid dataset usage ambiguity in training stages #81

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unlabeled Hybrid and Labeled Hybrid dataset usage ambiguity in training stages #81

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions