𝗞𝗥𝗖𝗮𝗽𝗩𝗟𝗠: 𝗕𝗲𝗮𝗺-𝗚𝘂𝗶𝗱𝗲𝗱 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗥𝗲𝗽𝗹𝗮𝘆 𝗳𝗼𝗿 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲-𝗥𝗶𝗰𝗵 𝗜𝗺𝗮𝗴𝗲 𝗖𝗮𝗽𝘁𝗶𝗼𝗻𝗶𝗻𝗴 𝘂𝘀𝗶𝗻𝗴 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹

Pattern Recognition 2026.

Introduction

Enhanced K-Replay is an advanced deep learning model that addresses the challenge of generating accurate and knowledge-rich image captions. It enhances the K-Replay framework by integrating:

Beam search decoding for generating diverse and accurate pseudo-captions,
Attention layers to better focus on relevant image regions,
Learning rate schedulers to stabilize training.

These modifications help the model retain and express real-world knowledge, achieving notable improvements in both caption quality and concept recognition accuracy.

Project Metadata

Author: Reem AlJunaid
Supervisor: Dr. Muzammil Behzad
Affiliations: KFUPM and IAU

Reference Paper

Cheng et al., “Beyond generic: Enhancing image captioning with real-world knowledge using vision-language pre-training model”, ACM MM 2023 KnowCap paper

Datasets Used

🖼️ MS-COCO

Description: A standard image captioning dataset.
Split (Karpathy):
- Train: 113,287 images
- Validation: 5,000 images
- Test: 5,000 images
Each image is paired with five human-written captions.

🧠 Replay CC12M Subset

Description: A curated subset of over 20,000 samples extracted from CC12M.
Filtering Criteria: Image-text pairs that mention any of 122 predefined keywords.
Use: These samples are employed as replay exemplars during training.

🔍 KnowCap Dataset

Description: Enhances captioning with real-world knowledge.
Total Pairs: 1,424 image-caption pairs across 240 knowledge categories.
Validation Set: 424 samples
Test Set: 1,000 samples
- Unseen Set: 520 samples with 120 categories not in the predefined keyword list
Use: Evaluates the model’s generalization to new, unseen knowledge concepts.

Terminologies

Image Captioning: The task of generating natural language descriptions for images.
Vision-Language Pretraining (VLP): Training models on large-scale image-text pairs to learn joint visual and textual representations.
Knowledge-rich Captions: Captions that include contextual, domain-specific, or named-entity knowledge beyond generic descriptions.
K-Replay: A continual learning framework that replays previously seen knowledge-rich samples to mitigate forgetting during fine-tuning.
Pseudo-Captions: Captions generated by the model to simulate training targets, often used for weakly-supervised or replay training.
Beam Search: A decoding strategy used to generate multiple diverse caption hypotheses, improving quality over greedy decoding.
Attention Mechanism: A neural network component that helps the model focus on important regions of an image while generating captions.
Catastrophic Forgetting: The tendency of neural networks to forget previously learned information when trained on new data.
Learning Rate Scheduler: A strategy to adjust the learning rate during training to stabilize convergence and improve performance.

Problem Statements

Problem 1: Existing image captioning models tend to produce generic captions that miss contextual and real-world knowledge.
Problem 2: Vision-language pretraining (VLP) models struggle with zero-shot inference and often hallucinate knowledge.
Problem 3: Fine-tuning VLP models introduces a generic bias that limits knowledge expression.
Problem 4: The original K-Replay framework lacks stability and still leaves room for performance improvement.

Proposed Enhancements

Replace greedy decoding with beam search.
Add attention layers to image encoders for better visual focus.
Use cosine learning rate schedulers for smoother convergence.

Key Components

config.py: Contains configuration settings for training and evaluation.
data/: JSON files for COCO, CC12M, and KnowCap datasets used during training and testing.
data_load.py: Loads and preprocesses datasets for training and evaluation.
test.py: Evaluation script for COCO dataset.
test_knowcap.py: Evaluation script for KnowCap dataset.
models/: Contains backbone OFA model.
train_multitask.py: Training script for the Enhanced K-Replay model.
utils/: Includes utilities such as:
- beamsearch.py: Beam search decoding logic.
- cc12m.py: Replay sample filtering from CC12M.
- convert_ofa.py: Checkpoint conversion for OFA.
- eval.py: Caption generation and metric calculation.
- import_models.py, log.py, loss.py, optimizer_tools.py: Misc. training and logging support.
- prepro_data.py: Dataset construction and formatting.

How to Run the Code

Download the Images

Prepare Data for Training, Validation, and Testing

prepro_data.py

Alternatively, you can use the preprocessed data and place it inside the ./data directory.
Make sure to update the file_path entries in each json file to match the location of your downloaded images (Step 1).
Also, adjust the necessary parameters and thier paths in config.py and all the other files based on your environment.

Download Pretrained OFA-Large

Follow the original instructions to prepare the checkpoints for VLP models (e.g., OFA):

Download the Transformers version of OFA-large and OFA-large-caption.

!git clone --single-branch --branch feature/add_transformers https://github.com/OFA-Sys/OFA.git
!git clone https://huggingface.co/OFA-Sys/OFA-large-caption
!git clone https://huggingface.co/OFA-Sys/OFA-large

Due to known issues with these checkpoints, convert them using convert_ofa.py to align with the official Fairseq parameters.

Alternatively, you can directly use the converted OFA-large checkpoints and finetuned OFA-large-caption checkpoints we provide.

Replace the original pycocoevalcap/eval.py` with eval.py

import shutil
custom_file_path = '/content/drive/My Drive/DL Project/KnowCap-master/eval.py'  # Path to your custom eval.py file
destination_path = '/usr/local/lib/python3.11/dist-packages/pycocoevalcap/eval.py'  # Path where you want to copy the custom file
shutil.copy(custom_file_path, destination_path)

Train the Enhanced K-Replay Model

!nohup python train_multitask_beam.py --mode train \
    --model OFA --id ofa_kreplay_with_scheduler_beam_attention --batch_size 8 --epochs 10 \
    --learning_rate 7e-6 --label_smoothing 0.1 \
    --multitask_weight 1.0 --KD_temperature 16.0 \
    --knowdistill_weight 1.0 --save_model_freq 100 \
    --ofa_ckpts "/content/drive/MyDrive/DL Project/KnowCap-master/checkpoints/ofa/OFA-large-caption-trainedenc" \
    --ofa_ckpts_distill "/content/drive/MyDrive/DL Project/KnowCap-master/checkpoints/ofa/OFA-large-caption-XEfinetuned" \
    --train_mix "/content/drive/My Drive/DL Project/KnowCap-master/data/train_mix_32000.json" \
    --use_patch_self_attn\
    --method XEdistill > train_ofa_kreplay_scheduler_beam_attention.log 2>&1 &

Evaluate Trained Model on COCO

!nohup bash -c 'CUDA_VISIBLE_DEVICES=0 python test.py \
    --model OFA \
    --id ofa_kreplay_with_scheduler_beam_attention \
    --trained_ckpts "/content/drive/MyDrive/DL Project/KnowCap-master/checkpoints/ofa/log/ofa_kreplay_with_scheduler_beam_attention/model/model_4800.pt" \
    --use_patch_self_attn\
    --step 4800 --length_penalty 1.0' > test_beam_attention_4800_COCO.log 2>&1 &

Evaluate Trained Model on KnowCap

!nohup bash -c 'CUDA_VISIBLE_DEVICES=0 python test_knowcap.py \
    --model OFA \
    --id ofa_kreplay_with_scheduler_beam_attention \
    --trained_ckpts "/content/drive/MyDrive/DL Project/KnowCap-master/checkpoints/ofa/log/ofa_kreplay_with_scheduler_beam_attention/model/model_4800.pt" \
    --use_patch_self_attn\
    --step 4800 --length_penalty 1.0' > test_beam_attention_4800_knowcap.log 2>&1 &

Evaluation Metrics

CIDEr
BLEU
ROUGE
METEOR
Recognition Accuracy (for Knowledge concepts)

Results on KnowCap (Highlights)

Method	C	RA
OFA zero-shot	39.2	39.80%
OFA-Finetuned	41.7	38.50%
+K-Replay	90.3	50.40%
+Scheduler	92.6	54.20%
+Beam	92.6	63.30%
+Attention	92.0	58.90%

Acknowledgments

Thanks to the KnowCap authors for open-sourcing their framework.
Special appreciation to Dr. Muzammil Behzad for supervision and guidance.
Gratitude to KFUPM and associated research staff for support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Project Metadata

Reference Paper

Datasets Used

Terminologies

Problem Statements

Proposed Enhancements

Key Components

How to Run the Code

Evaluation Metrics

Results on KnowCap (Highlights)

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
evaluation		evaluation
models		models
utils		utils
LICENSE		LICENSE
README.md		README.md
config.py		config.py
data_load.py		data_load.py
eval.py		eval.py
output results.png		output results.png
output results2.png		output results2.png
requirements.txt		requirements.txt
test.py		test.py
test_knowcap.py		test_knowcap.py
train_multitask_beam.py		train_multitask_beam.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Introduction

Project Metadata

Reference Paper

Datasets Used

Terminologies

Problem Statements

Proposed Enhancements

Key Components

How to Run the Code

Evaluation Metrics

Results on KnowCap (Highlights)

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages