This project implements and evaluates Data-Efficient Image Transformer (DeiT), an improved version of Vision Transformer (ViT) that enhances performance using knowledge distillation. The goal is to compare DeiT with ViT and analyze its effectiveness in image classification on smaller datasets.
- DeiT vs ViT: Comparative analysis of performance.
- Distillation Token: Uses a ResNet-50 teacher model to guide training.
- Data Augmentation: Implements techniques like CutMix, MixUp, Horizontal Flip, and Random Erasing.
- Performance Metrics: Evaluates models using Accuracy, AUC, F1 Score, Precision, and Recall.
- CIFAR-10: 50,000 training images and 10,000 test images (32x32 resolution).
- Chosen for its well-labeled structure and availability of pre-trained models.
Ensure you have Python 3.8+ installed. Install dependencies using:
pip install torch torchvision timm transformers numpy matplotlibTrain DeiT on CIFAR-10 using:
python train.py --dataset cifar10 --epochs 20 --batch_size 64 --lr 0.001To train with a ResNet-50 teacher model:
python train.py --dataset cifar10 --distillation --teacher_model resnet50Evaluate the trained model:
python evaluate.py --model deit --dataset cifar10- ViT vs DeiT: DeiT outperforms vanilla ViT on CIFAR-10.
- Distillation Boost: Adding a teacher model improves F1 Score and AUC.
- Data Augmentation: Enhances accuracy and reduces overfitting.
- High validation loss: Addressed using data augmentation.
- Slow CutMix and MixUp: Reduced probability of application to optimize computation time.
- Extend to object detection tasks.
- Implement DeiT using TensorFlow/Keras.
- Improve interpretability with attention visualizations.
- Vignesh Ram Ramesh Kutti
- Aravind Balaji Srinivasan