Implementation of DeiT - Data-Efficient Image Transformer

Overview

This project implements and evaluates Data-Efficient Image Transformer (DeiT), an improved version of Vision Transformer (ViT) that enhances performance using knowledge distillation. The goal is to compare DeiT with ViT and analyze its effectiveness in image classification on smaller datasets.

Features

DeiT vs ViT: Comparative analysis of performance.
Distillation Token: Uses a ResNet-50 teacher model to guide training.
Data Augmentation: Implements techniques like CutMix, MixUp, Horizontal Flip, and Random Erasing.
Performance Metrics: Evaluates models using Accuracy, AUC, F1 Score, Precision, and Recall.

Dataset

CIFAR-10: 50,000 training images and 10,000 test images (32x32 resolution).
Chosen for its well-labeled structure and availability of pre-trained models.

Installation

Prerequisites

Ensure you have Python 3.8+ installed. Install dependencies using:

pip install torch torchvision timm transformers numpy matplotlib

Training the Model

Default Training

Train DeiT on CIFAR-10 using:

python train.py --dataset cifar10 --epochs 20 --batch_size 64 --lr 0.001

Using Knowledge Distillation

To train with a ResNet-50 teacher model:

python train.py --dataset cifar10 --distillation --teacher_model resnet50

Evaluation

Evaluate the trained model:

python evaluate.py --model deit --dataset cifar10

Results and Analysis

ViT vs DeiT: DeiT outperforms vanilla ViT on CIFAR-10.
Distillation Boost: Adding a teacher model improves F1 Score and AUC.
Data Augmentation: Enhances accuracy and reduces overfitting.

Challenges and Solutions

High validation loss: Addressed using data augmentation.
Slow CutMix and MixUp: Reduced probability of application to optimize computation time.

Future Enhancements

Extend to object detection tasks.
Implement DeiT using TensorFlow/Keras.
Improve interpretability with attention visualizations.

Authors

Vignesh Ram Ramesh Kutti
Aravind Balaji Srinivasan

References

Touvron, H., et al. (2020). "Training data-efficient image transformers & distillation through attention." arXiv
Dosovitskiy, A., et al. (2020). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." arXiv
Vaswani, A., et al. (2017). "Attention is all you need." arXiv

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Project		Project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implementation of DeiT - Data-Efficient Image Transformer

Overview

Features

Dataset

Installation

Prerequisites

Training the Model

Default Training

Using Knowledge Distillation

Evaluation

Results and Analysis

Challenges and Solutions

Future Enhancements

Authors

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Implementation of DeiT - Data-Efficient Image Transformer

Overview

Features

Dataset

Installation

Prerequisites

Training the Model

Default Training

Using Knowledge Distillation

Evaluation

Results and Analysis

Challenges and Solutions

Future Enhancements

Authors

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages