This project addresses the continual learning and unsupervised domain adaptation challenges posed in CS771 Mini-Project 2. We tackle the problem of learning from sequential datasets while preventing catastrophic forgetting, using Learning with Prototypes (LwP) as the base classifier with novel prototype update mechanisms.
- 20 sequential datasets derived from CIFAR-10
- Task 1: Datasets Dβ-Dββ from the same distribution p(x)
- Task 2: Datasets Dββ-Dββ from different but related distributions
- Constraint: Only Dβ is labeled; all others are unlabeled
- Goal: Maintain performance on previous datasets while adapting to new ones
- Akshat Sharma (230101)
- Dweep Joshipura (230395)
- Kanak Khandelwal (230520)
- Praneel B Satare (230774)
For sequential datasets from the same distribution, we developed a mathematically principled update mechanism:
Key Characteristics:
- Ξ± = 0.2 (optimally tuned): Balances old knowledge retention vs. new data adaptation
- Prevents catastrophic forgetting: Maintains 98%+ accuracy across all previous datasets
- Interpretable: Ξ± β β preserves old prototypes, Ξ± β 0 uses only new data
For datasets with distribution shifts, we introduced an unsupervised adaptation method:
Novel Approach:
- Class-aware K-means: Initialize cluster centers with previous prototypes
- Automatic adaptation: Clusters adjust to new data distributions
- Balanced update: Ξ² = 1 equally weighs old prototypes and new centroids
Given the constraint of not using CIFAR-trained models, we explored ImageNet pre-trained extractors:
| Model | Feature Dim | Accuracy on Dβ | Selected |
|---|---|---|---|
| ResNet | 2048 | 84.12% | β |
| MobileNetv3 | 960 | 83.72% | β |
| CaiT-M36 | 768 | 94.20% | β |
| ViT-Base | 768 | 96.52% | β |
| Eva02-Base | 768 | 96.88% | β |
| BEiT-Large | 1024 | 98.72% | β |
Initial experiments without feature extraction showed the necessity of our approach:
| Method | Training Accuracy |
|---|---|
| LwP (Euclidean, Raw) | 29.04% |
| LwP (Mahalanobis, Raw) | 9.52% |
| LwP (Euclidean, PCA-50) | 28.56% |
| LwP (Mahalanobis, PCA-50) | 41.20% |
Performance Matrix: Models fβ to fββ on held-out datasets DΜβ to DΜββ
| Model | DΜβ | DΜβ | DΜβ | DΜβ | DΜβ | DΜβ | DΜβ | DΜβ | DΜβ | DΜββ |
|---|---|---|---|---|---|---|---|---|---|---|
| fβ | 98.32% | β | β | β | β | β | β | β | β | β |
| fβ | 98.36% | 97.84% | β | β | β | β | β | β | β | β |
| fβ | 98.16% | 97.76% | 98.16% | β | β | β | β | β | β | β |
| fβ | 98.16% | 97.76% | 98.04% | 97.92% | β | β | β | β | β | β |
| fβ | 98.20% | 97.68% | 97.96% | 98.00% | 97.92% | β | β | β | β | β |
| fβ | 98.12% | 97.84% | 98.00% | 97.92% | 97.96% | 98.40% | β | β | β | β |
| fβ | 98.16% | 97.72% | 97.92% | 97.92% | 98.00% | 98.36% | 97.40% | β | β | β |
| fβ | 98.00% | 97.76% | 97.80% | 97.84% | 97.88% | 98.36% | 97.36% | 97.56% | β | β |
| fβ | 98.08% | 97.72% | 97.84% | 97.96% | 97.84% | 98.28% | 97.36% | 97.60% | 97.68% | β |
| fββ | 98.04% | 97.76% | 97.88% | 97.96% | 97.84% | 98.28% | 97.36% | 97.60% | 97.64% | 97.88% |
Key Achievement: β No catastrophic forgetting - consistent ~98% accuracy across all datasets
Performance Matrix: Models fββ to fββ on held-out datasets DΜββ to DΜββ
| Model | DΜββ | DΜββ | DΜββ | DΜββ | DΜββ | DΜββ | DΜββ | DΜββ | DΜββ | DΜββ |
|---|---|---|---|---|---|---|---|---|---|---|
| fββ | 90.36% | β | β | β | β | β | β | β | β | β |
| fββ | 90.36% | 75.92% | β | β | β | β | β | β | β | β |
| fββ | 90.36% | 75.92% | 93.56% | β | β | β | β | β | β | β |
| fββ | 90.36% | 75.92% | 93.56% | 97.28% | β | β | β | β | β | β |
| fββ | 90.36% | 75.92% | 93.56% | 97.28% | 97.92% | β | β | β | β | β |
| fββ | 90.36% | 75.92% | 93.56% | 97.28% | 97.92% | 94.56% | β | β | β | β |
| fββ | 90.36% | 75.92% | 93.56% | 97.28% | 97.92% | 94.56% | 94.56% | β | β | β |
| fββ | 90.36% | 75.92% | 93.56% | 97.28% | 97.92% | 94.56% | 94.56% | 91.32% | β | β |
| fββ | 90.36% | 75.92% | 93.56% | 97.28% | 97.92% | 94.56% | 94.56% | 91.32% | 76.48% | β |
| fββ | 90.36% | 75.92% | 93.56% | 97.28% | 97.92% | 94.56% | 94.56% | 91.32% | 76.48% | 97.24% |
Key Achievement: β Successful domain adaptation - ~5% average improvement through clustering-based updates
- Hardware: Kaggle P100 GPU
- Feature Extraction Time: 2h 38m for all datasets
- Memory: Efficient prototype storage (~10KB per model)
- Ξ± tuning: Grid search over [0.1, 0.2, 0.3, ..., 2.0]
- Optimal Ξ±: 0.2 (maximizes fββ accuracy on DΜβ)
- Ξ² selection: Set to 1.0 based on equal weighting heuristic
β
No CIFAR-trained models: Used ImageNet pre-trained BEiT-Large
β
Same model size: Consistent prototype dimensions across updates
β
No labeled data: Only Dβ labels used; rest are pseudo-labels
β
LwP requirement: Base classifier remains Learning with Prototypes
CS771-Project-2/
βββ notebooks/
β βββ task1_sequential_learning.ipynb # Task 1 implementation
β βββ task2_domain_adaptation.ipynb # Task 2 implementation
β βββ feature_extraction.ipynb # BEiT feature extraction
βββ docs/
β βββ report.pdf # LaTeX project report
βββ README.md
As required by the project, we presented a detailed review of the below paper in a YouTube video.
- Deja Vu: Continual Model Generalization for Unseen Domains (ICLR 2023)
YouTube Presentation: Deja Vu: Continual Model Generalization for Unseen Domains
- Weighted Updates: The Ξ± parameter creates a principled balance between stability and plasticity
- Feature Quality: BEiT-Large provides rich, transferable representations
- Class-Aware Clustering: Initializing with prototypes maintains class structure
- Unsupervised Adaptation: No need for labeled data in new domains
- Dataset Dββ Challenge: Poor cluster-class correlation affects performance
- Ξ² Selection: Currently heuristic; could benefit from adaptive methods
- Scalability: Limited to prototype-based methods per project constraints
| Metric | Task 1 | Task 2 |
|---|---|---|
| Average Accuracy | 97.8% | 89.4% |
| Catastrophic Forgetting | β Prevented | β Minimal |
| Domain Adaptation | N/A | +5% improvement |
| Computational Efficiency | β Prototype-based | β K-means clustering |
β
Successfully prevented catastrophic forgetting in sequential learning
β
Developed novel weighted prototype updates with theoretical foundation
β
Achieved effective domain adaptation without labeled target data
β
Maintained consistent model size across all updates
β
Comprehensive evaluation with detailed accuracy matrices
β
Efficient implementation suitable for resource-constrained environments
- Course: CS771 - Introduction to Machine Learning, IIT Kanpur, Autumn 2024
- Problem Statement: Mini-Project 2 - Continual Learning with LwP
- BEiT: Bao, H., Dong, L., Piao, S., & Wei, F. (2022). BEiT: BERT Pre-training of Image Transformers
- Domain Adaptation: Fernando, B., et al. (2014). Subspace alignment for domain adaptation
- Clustering Methods: Dridi, J., et al. (2024). Unsupervised clustering-based domain adaptation
For questions about this implementation or the CS771 course project:
- Course Instructor: Course Website
- Team Contact: Create an issue
This project demonstrates practical solutions to continual learning challenges while adhering to the constraints and requirements of CS771 Mini-Project 2. The proposed methods show promise for real-world applications where models must adapt to new data distributions without forgetting previous knowledge.