This repository provides the official implementation of PRPO (Paragraph-level Relative Policy Optimization), a reinforcement learning framework designed to improve reasoning and detection accuracy in vision-language deepfake detection tasks. The paper has been accepted at ICML 2026.
This folder contains the DX-LLaVA implementation for fine-tuning LLaVA (Large Language and Vision Assistant) with ConvNeXt vision encoder for deepfake detection tasks. DX-LLAVA enhances the standard LLaVA architecture by integrating ConvNeXt as the vision encoder, specifically optimized for visual deception detection. The model follows a two-stage training process: feature alignment pretraining and visual instruction tuning.
- CUDA-capable GPU (recommended: 2x H100 94GB or similar)
- Python 3.10+
- Conda/Miniconda
Create and activate the conda environment using the provided configuration:
conda env create -f llava.yaml
conda activate llavaTo access the DF-R5 dataset, please fill out the request form:
Your dataset should be organized as follows:
DF_R5_processed_images/
├── [domain name]/
│ ├── train/
│ │ ├── fake/
│ │ └── real/
│ ├── val/
│ │ ├── fake/
│ │ └── real/
│ └── test/
│ ├── fake/
│ └── real/
Ensure you have the base model checkpoints available:
- Pre-trained LLaVA checkpoint
- ConvNeXt vision encoder weights
./run_train_dx_llava.shKey training parameters can be modified in run_train_dx_llava.sh:
# Model and data paths
MODEL_PATH="../checkpoints/dx-llava-binary-ddim"
IMAGE_FOLDER="../DF_R5_processed_images"
TRAIN_DATA="./data/train.json"
# Training hyperparameters
BATCH_SIZE=8
LEARNING_RATE=2e-5
NUM_EPOCHS=3
MAX_LENGTH=2048Evaluate your trained model:
./run_eval_dx_llava.shThe evaluation script includes:
- Question Generation: Creates evaluation questions from dataset
- Model Inference: Runs multi-GPU inference for efficiency
- Performance Metrics: Calculates accuracy, precision, recall, and F1-score
Modify evaluation parameters in run_eval_dx_llava.sh:
# Evaluation settings
MODEL_PATH="../checkpoints/dx-llava-binary-ddim"
TEST_DATA="./question/df_question_ddim_1000.jsonl"
OUTPUT_FILE="./answer/dx_llava_eval_results.jsonl"
BATCH_SIZE=32DX-LLAVA incorporates:
- Vision Encoder: ConvNeXt-Large for robust visual feature extraction
- Vision-Language Connector: Two-layer MLP projection
- Language Model: Vicuna/LLaMA-based backbone
- Specialized Head: Binary classification for deepfake detection
- CUDA Out of Memory: Reduce batch size or use gradient accumulation
- DataLoader Errors: Verify dataset paths and format
- Checkpoint Loading: Ensure model paths are correct and accessible
For limited GPU memory:
# Use gradient checkpointing and mixed precision
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
export CUDA_LAUNCH_BLOCKING=1If you use this work in your research, please cite:
@inproceedings{tuan2026prpo,
title={{PRPO}: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection},
author={Tuan Nguyen and Naseem Khan and Khang Tran and NhatHai Phan and Issa Khalil},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=BGcw0KWStP}
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Original LLaVA implementation
- ConvNeXt architecture
- DeepSpeed optimization framework
