This project provides Python implementations of minimal RNN models (MinGRU and MinLSTM) with highly optimized CPU and GPU (CUDA) versions. The implementation aims to provide an accessible interface while maintaining high performance.
Minimal RNN models simplify the standard RNN architectures by removing the dependence on the previous hidden state in the gate computations, making them more amenable to parallel processing while maintaining good performance.
The minimal GRU (MinGRU) simplifies the standard GRU by using the following equations:
- Update Gate: z_t = σ(Linear_z(x_t))
- Candidate State: h_tilde = Linear_h(x_t)
- Hidden State: h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h_tilde
The minimal LSTM (MinLSTM) simplifies the standard LSTM with the following equations:
- Forget Gate: f_t = σ(Linear_f(x_t))
- Input Gate: i_t = σ(Linear_i(x_t))
- Candidate State: h_tilde = Linear_h(x_t)
- Gate Normalization: f'_t = f_t / (f_t + i_t), i'_t = i_t / (f_t + i_t)
- Hidden State: h_t = f't ⊙ h{t-1} + i'_t ⊙ h_tilde
- Pure Python implementation with NumPy for CPU processing
- Highly optimized CUDA acceleration using Numba for GPU processing
- Adaptive algorithm selection based on sequence length
- Support for both sequential and parallel processing modes
- Benchmarking tools to compare performance across different implementations
- Comprehensive documentation and examples
- Python 3.6+
- NumPy
- Numba (for CUDA acceleration)
- CUDA toolkit (for GPU acceleration)
- Matplotlib (for benchmark visualization)
- Pandas (for data manipulation)
- Tabulate (for pretty-printing benchmark results)
- Clone the repository:
git clone https://github.com/DaoudiAmir/CUDA_RNN.git
cd CUDA_RNN- Install the required packages:
pip install -r requirements.txtimport numpy as np
from min_rnn.min_gru import MinGRUCell
from min_rnn.min_lstm import MinLSTMCell
# Create a MinGRU cell
gru_cell = MinGRUCell(input_size=10, hidden_size=20)
# Process a single input
x_t = np.random.random(10)
h_prev = np.zeros(20)
h_t = gru_cell.forward(x_t, h_prev)
# Process a sequence
x_seq = np.random.random((5, 10)) # 5 time steps, 10 features
h0 = np.zeros(20)
h_out = gru_cell.process_sequence(x_seq, h0)
# Process a sequence in parallel
h_out_parallel = gru_cell.process_sequence_parallel(x_seq, h0)from min_rnn.min_gru_cuda import MinGRUCellCUDA
from min_rnn.min_lstm_cuda import MinLSTMCellCUDA
from min_rnn.cuda_utils import CUDA_AVAILABLE
if CUDA_AVAILABLE:
# Create a CUDA-accelerated MinGRU cell
gru_cell = MinGRUCellCUDA(input_size=10, hidden_size=20)
# Process a sequence with GPU acceleration
x_seq = np.random.random((5, 10))
h0 = np.zeros(20)
h_out = gru_cell.process_sequence(x_seq, h0)
# Process a sequence in parallel with GPU acceleration
h_out_parallel = gru_cell.process_sequence_parallel(x_seq, h0)
else:
print("CUDA is not available, falling back to CPU implementation")# Run the demo for both models with CPU
python main.py
# Run the demo for both models with CUDA (if available)
python main.py --cuda
# Run the demo for a specific model
python main.py --model gru
python main.py --model lstm
# Run benchmarks
python benchmark_cuda_optimization.py --model bothmin_rnn/: Main package directoryutils.py: Utility functions (sigmoid, linear layer)min_gru.py: CPU implementation of MinGRUmin_lstm.py: CPU implementation of MinLSTMmin_gru_cuda.py: CUDA-accelerated implementation of MinGRUmin_lstm_cuda.py: CUDA-accelerated implementation of MinLSTMcuda_utils.py: CUDA kernels and utility functions
benchmark_cuda_optimization.py: Benchmarking script for CUDA optimizationscuda_optimization_results.md: Detailed documentation of optimization resultsmain.py: Demo scriptrequirements.txt: Required packagesREADME.md: This file
The CUDA implementation incorporates several optimization techniques to maximize GPU performance:
- Pinned Memory: Used for faster host-device transfers
- CUDA Streams: Implemented asynchronous operations
- Shared Memory: Leveraged GPU shared memory to reduce global memory access latency
- Adaptive Algorithm Selection:
- Direct computation for short sequences (≤16)
- Fused kernel approach for medium sequences (16-128)
- Parallel scan for long sequences (>128)
- Batch Processing: Implemented kernels that process multiple elements per thread
- Kernel Fusion: Combined operations to reduce kernel launch overhead
- Optimized Grid/Block Sizes: Tuned launch configurations for better GPU occupancy
- Grid Stride Loops: Implemented to handle multiple elements per thread
- Coalesced Memory Access: Improved memory access patterns for better throughput
Our optimized CUDA implementations show significant performance improvements, especially for longer sequences:
- Short sequences (8-64): CPU is faster due to transfer overhead
- Medium sequences (128-256): GPU achieves 2.3-4.6x speedup
- Long sequences (512-1024): GPU achieves 4.2-6.3x speedup
- Short sequences (8-16): CPU is faster due to transfer overhead
- Medium sequences (32-128): GPU performance approaches CPU (0.66-1.11x)
- Long sequences (256-1024): GPU achieves impressive 6.2-9.4x speedup
For detailed benchmark results and analysis, see cuda_optimization_results.md.
- For optimal performance, use the appropriate implementation based on sequence length:
- CPU implementation for very short sequences (<32)
- GPU implementation for longer sequences (>128)
- The parallel scan algorithm provides significant speedup for long sequences
- Memory transfers between CPU and GPU can be a bottleneck for short sequences
- For optimal performance, use float32 data types and avoid unnecessary memory transfers
This project is licensed under the MIT License - see the LICENSE file for details.
While this implementation draws inspiration from the groundbreaking research paper:
Feng, L., Tung, F., Ahmed, M. O., Bengio, Y., & Hajimirsadeghi, H. (2024). Were RNNs All We Needed? arXiv preprint arXiv:2410.01201. https://arxiv.org/abs/2410.01201
Our work represents a significant departure from the original implementation. The paper's authors provided a PyTorch implementation, whereas our project offers a completely native implementation using Python with direct CUDA programming through Numba.
Key contributions of our implementation include:
- Native CUDA kernels that directly interact with the GPU, bypassing the overhead of deep learning frameworks
- Custom memory management optimizations including pinned memory and shared memory usage
- Adaptive algorithm selection based on sequence length that outperforms the original implementation for long sequences
- Comprehensive benchmarking tools and performance analysis
- Highly optimized parallel scan algorithms for processing long sequences efficiently
This project demonstrates that low-level CUDA programming can achieve superior performance compared to framework-based implementations, especially for specialized models like MinGRU and MinLSTM where fine-grained control over memory and computation is critical.





