A GAN-based Diffusion architecture for enhancing low-bitrate/low-resolution video frames into high-fidelity, temporally stable video sequences.
alen-vfe combines the power of Latent Diffusion Models, Adversarial Training, and Temporal Frame Interpolation to deliver state-of-the-art video enhancement.
- 🚀 Fast Inference: diffusion using DDIM sampling
- 💾 Memory Efficient: LoRA fine-tuning
- 🎨 High Quality: Combined loss (MSE + LPIPS + Adversarial) ensures sharp, realistic results
- 🎬 Temporal Stability: RIFE integration eliminates flickering
┌─────────────────────────────────────────────────────────────┐
│ Input: Low-Res Video │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Generator (Stable Diffusion v1.5) │
│ + LoRA Fine-tuning │
│ (1 step DDIM inference) │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Discriminator (PatchGAN) │
│ Evaluates realism of enhancements │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Smoothing Layer (RIFE) │
│ Temporal Frame Interpolation │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Output: High-Res Video │
└─────────────────────────────────────────────────────────────┘
-
Generator: Lightweight Latent Diffusion Model (Stable Diffusion v1.5)
- Fine-tuned with LoRA (Low-Rank Adaptation)
- Optimized inference using DDIM
-
Discriminator: Pre-trained PatchGAN
- Evaluates high-frequency detail realism
- Provides adversarial feedback during training
-
Smoothing Layer: RIFE (Real-Time Intermediate Flow Estimation)
- Optical flow-based frame interpolation
- Ensures temporal consistency
- Eliminates flickering between frames
- Python 3.8+
- CUDA 11.7+ (for NVIDIA GPUs) or Mac M4 with MPS support
- FFmpeg (for video processing)
# Clone the repository
git clone https://github.com/yourusername/alen-vfe.git
cd alen-vfe
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download RIFE pretrained model
python scripts/download_rife.pyfrom inference.enhancer import VideoEnhancer
from omegaconf import OmegaConf
# Load configuration
config = OmegaConf.load("config/inference_config.yaml")
# Initialize enhancer
enhancer = VideoEnhancer(config)
# Enhance video
enhancer.enhance_video(
input_path="input_video.mp4",
output_path="enhanced_video.mp4",
scale_factor=4
)python inference/enhance.py \
--input input_video.mp4 \
--output enhanced_video.mp4 \
--checkpoint checkpoints/best_model.pth \
--scale 4 \
--enable-rifeWe use Vimeo-90K dataset for training:
# Download dataset
python data/download.py --dataset vimeo90k --output ./data
# Prepare training data
python data/prepare_dataset.py \
--dataset vimeo90k \
--downscale-factor 4 \
--output ./data/processed- Upload the project to Kaggle
- Open
notebooks/train_kaggle.ipynb - Ensure GPU accelerator is enabled (T4 recommended)
- Run all cells
python training/train.py \
--config config/training_config.yaml \
--output-dir ./checkpoints- Size: ~7GB (perfect for quick start!)
- Images: 800 training + 100 validation
- Resolution: Up to 2K high-quality images
- Download: Official Link
- Why: Much smaller than Vimeo-90K, faster downloads, great for testing
- Size: ~82GB
- Sequences: 89,800 triplets (3 frames each)
- Resolution: 448×256
- Download: Official Link
- Why: Video-specific data, more data for production models
Edit config/training_config.yaml:
model:
generator:
lora_rank: 8 # Higher = more capacity, more VRAM
inference_steps: 4 # 1-4 steps for fast inference
training:
batch_size: 8
num_epochs: 100
learning_rate:
generator: 1.0e-5
discriminator: 4.0e-4
loss:
weights:
mse: 1.0
lpips: 0.5
adversarial: 0.1Edit config/inference_config.yaml:
enhancement:
scale_factor: 4
enable_rife: true
target_fps_multiplier: 2
video:
batch_size: 10
output_codec: "libx264"
output_crf: 18alen-vfe/
├── config/ # Configuration files
│ ├── training_config.yaml
│ └── inference_config.yaml
├── data/ # Dataset utilities
│ ├── __init__.py
│ ├── dataset.py
│ ├── download.py
│ └── prepare_dataset.py
├── models/ # Model architectures
│ ├── __init__.py
│ ├── generator.py # Stable Diffusion + LoRA
│ ├── discriminator.py # PatchGAN
│ └── rife.py # RIFE integration
├── training/ # Training infrastructure
│ ├── __init__.py
│ ├── losses.py
│ ├── trainer.py
│ └── utils.py
├── inference/ # Inference pipeline
│ ├── __init__.py
│ ├── enhancer.py
│ ├── enhance.py # CLI script
│ └── video_utils.py
├── notebooks/ # Jupyter notebooks for experiments and training
├── runs/ # TensorBoard event logs for training visualization
├── outputs/ # Enhanced video outputs and sample results
├── checkpoints/ # Model checkpoints (ignored by git)
├── dataset/ # Training datasets (ignored by git)
├── requirements.txt
└── README.md
notebooks/: Contains Jupyter notebooks for exploratory data analysis, experimental training runs, and Kaggle-specific setup.runs/: Stores TensorBoard event files. You can visualize training progress by runningtensorboard --logdir runs/.outputs/: This is where all enhanced videos, preview images, and test results are stored.checkpoints/: Directory for saving model weights during training.dataset/: Local storage for training data like DIV2K or Vimeo-90K.
# Run unit tests
pytest tests/
# Test inference pipeline
python tests/test_pipeline.py --checkpoint checkpoints/best_model.pth
# Benchmark performance
python tests/benchmark.py --device cudaThe model uses a combined loss function:
L_total = λ₁·L_MSE + λ₂·L_LPIPS + λ₃·L_ADV
- L_MSE: Pixel-wise Mean Squared Error (structural accuracy)
- L_LPIPS: Learned Perceptual Image Patch Similarity (perceptual quality)
- L_ADV: Adversarial Loss (realism)
Default weights: λ₁=1.0, λ₂=0.5, λ₃=0.1
Important
This project is currently in an experimental phase.
- Fine-tuning: We are experimenting with using a text-to-image model for video fine-tuning, which is a non-optimal approach and may lead to unexpected results.
- Resources: Due to a lack of significant training resources (GPU time/memory), the current model outputs may not yet reach production-grade quality.
- Outputs: I have shared my latest experimental outputs in the
outputs/folder for review.
Latest enhancement preview. See the
outputs/ folder for full video results.
- Stable Diffusion by Stability AI
- RIFE by Megvii Research
- pix2pix for PatchGAN architecture
- LPIPS for perceptual loss
MIT License - see LICENSE file for details
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
For questions or issues, please open a GitHub issue.