𝗞𝗥𝗖𝗮𝗽𝗩𝗟𝗠: 𝗕𝗲𝗮𝗺-𝗚𝘂𝗶𝗱𝗲𝗱 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗥𝗲𝗽𝗹𝗮𝘆 𝗳𝗼𝗿 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲-𝗥𝗶𝗰𝗵 𝗜𝗺𝗮𝗴𝗲 𝗖𝗮𝗽𝘁𝗶𝗼𝗻𝗶𝗻𝗴 𝘂𝘀𝗶𝗻𝗴 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹
Pattern Recognition 2026.
Enhanced K-Replay is an advanced deep learning model that addresses the challenge of generating accurate and knowledge-rich image captions. It enhances the K-Replay framework by integrating:
- Beam search decoding for generating diverse and accurate pseudo-captions,
- Attention layers to better focus on relevant image regions,
- Learning rate schedulers to stabilize training.
These modifications help the model retain and express real-world knowledge, achieving notable improvements in both caption quality and concept recognition accuracy.
- Author: Reem AlJunaid
- Supervisor: Dr. Muzammil Behzad
- Affiliations: KFUPM and IAU
- Cheng et al., “Beyond generic: Enhancing image captioning with real-world knowledge using vision-language pre-training model”, ACM MM 2023 KnowCap paper
- Description: A standard image captioning dataset.
- Split (Karpathy):
- Train: 113,287 images
- Validation: 5,000 images
- Test: 5,000 images
- Each image is paired with five human-written captions.
🧠 Replay CC12M Subset
- Description: A curated subset of over 20,000 samples extracted from CC12M.
- Filtering Criteria: Image-text pairs that mention any of 122 predefined keywords.
- Use: These samples are employed as replay exemplars during training.
🔍 KnowCap Dataset
- Description: Enhances captioning with real-world knowledge.
- Total Pairs: 1,424 image-caption pairs across 240 knowledge categories.
- Validation Set: 424 samples
- Test Set: 1,000 samples
- Unseen Set: 520 samples with 120 categories not in the predefined keyword list
- Use: Evaluates the model’s generalization to new, unseen knowledge concepts.
- Image Captioning: The task of generating natural language descriptions for images.
- Vision-Language Pretraining (VLP): Training models on large-scale image-text pairs to learn joint visual and textual representations.
- Knowledge-rich Captions: Captions that include contextual, domain-specific, or named-entity knowledge beyond generic descriptions.
- K-Replay: A continual learning framework that replays previously seen knowledge-rich samples to mitigate forgetting during fine-tuning.
- Pseudo-Captions: Captions generated by the model to simulate training targets, often used for weakly-supervised or replay training.
- Beam Search: A decoding strategy used to generate multiple diverse caption hypotheses, improving quality over greedy decoding.
- Attention Mechanism: A neural network component that helps the model focus on important regions of an image while generating captions.
- Catastrophic Forgetting: The tendency of neural networks to forget previously learned information when trained on new data.
- Learning Rate Scheduler: A strategy to adjust the learning rate during training to stabilize convergence and improve performance.
- Problem 1: Existing image captioning models tend to produce generic captions that miss contextual and real-world knowledge.
- Problem 2: Vision-language pretraining (VLP) models struggle with zero-shot inference and often hallucinate knowledge.
- Problem 3: Fine-tuning VLP models introduces a generic bias that limits knowledge expression.
- Problem 4: The original K-Replay framework lacks stability and still leaves room for performance improvement.
- Replace greedy decoding with beam search.
- Add attention layers to image encoders for better visual focus.
- Use cosine learning rate schedulers for smoother convergence.
config.py: Contains configuration settings for training and evaluation.data/: JSON files for COCO, CC12M, and KnowCap datasets used during training and testing.data_load.py: Loads and preprocesses datasets for training and evaluation.test.py: Evaluation script for COCO dataset.test_knowcap.py: Evaluation script for KnowCap dataset.models/: Contains backbone OFA model.train_multitask.py: Training script for the Enhanced K-Replay model.utils/: Includes utilities such as:beamsearch.py: Beam search decoding logic.cc12m.py: Replay sample filtering from CC12M.convert_ofa.py: Checkpoint conversion for OFA.eval.py: Caption generation and metric calculation.import_models.py,log.py,loss.py,optimizer_tools.py: Misc. training and logging support.prepro_data.py: Dataset construction and formatting.
- Download the Images
- Prepare Data for Training, Validation, and Testing
prepro_data.pyAlternatively, you can use the preprocessed data and place it inside the ./data directory.
Make sure to update the file_path entries in each json file to match the location of your downloaded images (Step 1).
Also, adjust the necessary parameters and thier paths in config.py and all the other files based on your environment.
- Download Pretrained OFA-Large
Follow the original instructions to prepare the checkpoints for VLP models (e.g., OFA):
- Download the Transformers version of OFA-large and OFA-large-caption.
!git clone --single-branch --branch feature/add_transformers https://github.com/OFA-Sys/OFA.git
!git clone https://huggingface.co/OFA-Sys/OFA-large-caption
!git clone https://huggingface.co/OFA-Sys/OFA-large- Due to known issues with these checkpoints, convert them using
convert_ofa.pyto align with the official Fairseq parameters.
Alternatively, you can directly use the converted OFA-large checkpoints and finetuned OFA-large-caption checkpoints we provide.
- Replace the original pycocoevalcap/eval.py` with eval.py
import shutil
custom_file_path = '/content/drive/My Drive/DL Project/KnowCap-master/eval.py' # Path to your custom eval.py file
destination_path = '/usr/local/lib/python3.11/dist-packages/pycocoevalcap/eval.py' # Path where you want to copy the custom file
shutil.copy(custom_file_path, destination_path)- Train the Enhanced K-Replay Model
!nohup python train_multitask_beam.py --mode train \
--model OFA --id ofa_kreplay_with_scheduler_beam_attention --batch_size 8 --epochs 10 \
--learning_rate 7e-6 --label_smoothing 0.1 \
--multitask_weight 1.0 --KD_temperature 16.0 \
--knowdistill_weight 1.0 --save_model_freq 100 \
--ofa_ckpts "/content/drive/MyDrive/DL Project/KnowCap-master/checkpoints/ofa/OFA-large-caption-trainedenc" \
--ofa_ckpts_distill "/content/drive/MyDrive/DL Project/KnowCap-master/checkpoints/ofa/OFA-large-caption-XEfinetuned" \
--train_mix "/content/drive/My Drive/DL Project/KnowCap-master/data/train_mix_32000.json" \
--use_patch_self_attn\
--method XEdistill > train_ofa_kreplay_scheduler_beam_attention.log 2>&1 &- Evaluate Trained Model on COCO
!nohup bash -c 'CUDA_VISIBLE_DEVICES=0 python test.py \
--model OFA \
--id ofa_kreplay_with_scheduler_beam_attention \
--trained_ckpts "/content/drive/MyDrive/DL Project/KnowCap-master/checkpoints/ofa/log/ofa_kreplay_with_scheduler_beam_attention/model/model_4800.pt" \
--use_patch_self_attn\
--step 4800 --length_penalty 1.0' > test_beam_attention_4800_COCO.log 2>&1 &
- Evaluate Trained Model on KnowCap
!nohup bash -c 'CUDA_VISIBLE_DEVICES=0 python test_knowcap.py \
--model OFA \
--id ofa_kreplay_with_scheduler_beam_attention \
--trained_ckpts "/content/drive/MyDrive/DL Project/KnowCap-master/checkpoints/ofa/log/ofa_kreplay_with_scheduler_beam_attention/model/model_4800.pt" \
--use_patch_self_attn\
--step 4800 --length_penalty 1.0' > test_beam_attention_4800_knowcap.log 2>&1 &- CIDEr
- BLEU
- ROUGE
- METEOR
- Recognition Accuracy (for Knowledge concepts)
| Method | C | RA |
|---|---|---|
| OFA zero-shot | 39.2 | 39.80% |
| OFA-Finetuned | 41.7 | 38.50% |
| +K-Replay | 90.3 | 50.40% |
| +Scheduler | 92.6 | 54.20% |
| +Beam | 92.6 | 63.30% |
| +Attention | 92.0 | 58.90% |
- Thanks to the KnowCap authors for open-sourcing their framework.
- Special appreciation to Dr. Muzammil Behzad for supervision and guidance.
- Gratitude to KFUPM and associated research staff for support.

