- Team: Basel Alzahrani
- Supervisor Name: Dr. Muzammil Behzad
- Affiliations: KFUPM
Large-scale vision-language models such as CLIP have demonstrated strong zero-shot recognition ability by aligning image and text representations. However, manually designed prompts are often suboptimal for downstream tasks, especially under few-shot learning settings where only limited labeled samples are available.
Prompt learning methods such as CoOp and CoCoOp improve adaptation by learning context tokens while keeping the pretrained CLIP backbone frozen. Although effective, these methods often rely only on class names and may produce unstable predictions under low-data supervision.
This project proposes Attribute-Aware Consistency Prompt Learning (AACPL), a lightweight extension of CoCoOp that improves prompt learning through richer semantic descriptions and dual-branch consistency regularization.
We:
- Enrich class prompts using semantic attributes.
- Learn two complementary prompt branches.
- Apply KL consistency loss between prompt branches.
- Use probability ensembling and temperature calibration at inference.
The goal is to improve few-shot classification accuracy while preserving the frozen visual encoder.
Few-shot prompt learning methods face several limitations:
Using only class names may not fully exploit CLIP’s language prior.
Different prompt variants may produce inconsistent predictions under few-shot training.
Can we improve CoCoOp without modifying the visual backbone or adding heavy multimodal modules?
We evaluate whether semantic prompt enrichment and consistency learning can improve OxfordPets accuracy under the 16-shot setting.
This project belongs to:
- Computer Vision
- Vision-Language Models
- Prompt Learning
- Few-Shot Learning
- Transfer Learning
Applications include:
- Fine-grained image recognition
- Low-data domain adaptation
- Efficient deployment of pretrained models
- Lightweight adaptation for edge systems
he reference paper CoCoOp proposes conditional prompt learning by generating instance-conditioned prompt tokens using image features.
This project extends that idea through:
Proposed AACPL Improvements
- Attribute-Aware Prompt Construction
Instead of: Persian We use: Persian, fluffy long-haired flat-faced cat
This gives richer semantic supervision.
- Dual Prompt Branches
We use two prompt templates:
Template 1: a photo of a Template 2: a blurry photo of a
Each branch predicts independently.
- KL Consistency Regularization
Branch 2 is encouraged to match Branch 1 predictions using one-way KL divergence.
- Improved Inference
At test time:
logits from both branches are averaged temperature calibration is applied final probabilities are returned
- Presentation PDF: Project Presentation
- Presentation PPTX: Project Presentation
- Term Paper PDF: Term Paper
- Term Paper Latex Files: Term Paper Latex files
Contrastive Language-Image Pretraining model for aligned image-text embeddings.
Learning context tokens instead of full model fine-tuning.
Context Optimization with static learned prompts.
Conditional CoOp using image-conditioned prompts.
Training with only a few labeled examples per class.
Encouraging multiple predictors to output similar predictions.
Calibrating logits before softmax.
Class-name-only prompts underuse semantic language knowledge.
Few-shot prompt learning may become unstable.
Need stronger adaptation without expensive retraining.
- Better prompt robustness under low-data settings.
- Better semantic exploitation of CLIP text encoder.
- Efficient multi-prompt learning.
- Generalization to unseen classes.
Use semantic descriptors per class.
Train multiple prompt branches to agree.
Average branches and calibrate confidence.
This repository modifies the original CoCoOp implementation.
trainers/cocoop.py
- Added dual prompt branches
- Added KL consistency loss
- Added attribute-aware class prompts
- Added probability ensembling
- Added temperature calibration
- OxfordPets image
- Class text prompts
- Few-shot labeled samples
- Extract image features using frozen CLIP encoder.
- Generate prompts from two branches.
- Compute logits for both branches.
- Apply CE loss to both.
- Apply KL consistency loss.
- Final trained prompt learner
- Improved few-shot classifier
- Dataset: OxfordPets
- Shots: 16-shot
- Classes: Base classes
- Backbone: CLIP ViT-B/16
- Seed: 1
- Epochs: 10
| Method | Accuracy |
|---|---|
| CoCoOp Baseline | 94.8% |
| Consistency Prompt Learning | 95.5% |
| AACPL (Final) | 96.0% |
Improvement over baseline: +1.2%
| Lambda | Validation Accuracy |
|---|---|
| 10 | 98.7% |
| 20 | 98.7% |
| 30 | 98.7% |
| 40 | 98.7% |
The method remains stable across a wide range of consistency weights.
-
Clone the Repository:
git clone https://github.com/KaiyangZhou/CoOp cd CoOp -
Replace File replace existing trainers/cocoop.py with tho one in this repo
-
Set Up the Environment: Create a virtual environment and install the required dependencies.
python3 -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate pip install -r requirements.txt
-
Dataset setup:
Download the Oxford-IIIT Pet dataset and split file from CoOp and place it inside your dataset root directory.
Example:
C:/Users/basel/Desktop/datasets/oxford_pets
- Training
Configure the training parameters in the provided configuration file and run:
python train.py --root C:/Users/basel/Desktop/datasets --seed 1 --trainer CoCoOp --dataset-config-file configs/datasets/oxford_pets.yaml --config-file configs/trainers/CoCoOp/vit_b16_c4_ep10_batch1_ctxv1.yaml DATASET.NUM_SHOTS 16 DATASET.SUBSAMPLE_CLASSES base
Thanks to:
Dr. Muzammil Behzad for supervision and guidance. Open-source contributors of CLIP, CoOp, and CoCoOp. KFUPM for academic support.