A comprehensive collection of efficient Transformer implementations based on nanoGPT, designed for educational purposes. This repository demonstrates various optimization techniques for language models, including Flash Attention, MLP-Mixer, Mixture of Experts (MoE), Knowledge Distillation, Pruning, and Quantization.
Before getting started, make sure you have:
- Python 3.8 or higher installed on your system
- Basic understanding of Python programming
- Familiarity with neural networks (helpful but not required)
A virtual environment is an isolated Python environment that allows you to manage project-specific dependencies without affecting your system Python installation. Here's how to create one:
-
Open Command Prompt or PowerShell
- Press
Win + R, typecmdorpowershell, and press Enter
- Press
-
Navigate to the project directory
cd D:\Home\Research\FnT_Tutorial\Efficient_nanoGPT
-
Create a virtual environment
python -m venv venvThis creates a folder named
venvcontaining the virtual environment. -
Activate the virtual environment
venv\Scripts\activate
You should see
(venv)appear at the beginning of your command prompt, indicating the virtual environment is active. -
To deactivate later (when you're done working):
deactivate
-
Open Terminal
-
Navigate to the project directory
cd /path/to/Efficient_nanoGPT -
Create a virtual environment
python3 -m venv venv
-
Activate the virtual environment
source venv/bin/activateYou should see
(venv)at the beginning of your terminal prompt. -
To deactivate later:
deactivate
With your virtual environment activated, install the required packages:
pip install torch numpyNote: If you have a CUDA-compatible GPU and want to use it, install PyTorch with CUDA support:
- Visit https://pytorch.org/get-started/locally/
- Select your configuration and follow the installation instructions
- Example for CUDA 11.8:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Before running any of the scripts, you need to download the training data (Tiny Shakespeare dataset):
On Windows (PowerShell):
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" -OutFile "input.txt"On macOS/Linux or with wget:
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txtAlternative method (if wget is not available):
curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txtThe input.txt file should be placed in the Efficient_nanoGPT directory (same folder as the Python scripts).
Each Python file in this repository demonstrates a different optimization technique or architectural variation of the Transformer model. All implementations are based on the Tiny Shakespeare dataset for character-level language modeling.
What it does: This is the baseline implementation of a GPT-style Transformer model using standard self-attention mechanisms.
Key Features:
- Implements a complete Transformer architecture with multi-head self-attention
- Character-level language model trained on Tiny Shakespeare
- Standard training loop with train/validation split
- Text generation capability after training
When to use: Start here to understand the basic Transformer architecture before exploring optimizations.
Key Concepts:
- Multi-head self-attention
- Positional embeddings
- Feed-forward networks
- Layer normalization and residual connections
What it does: Replaces the standard attention implementation with PyTorch's optimized scaled_dot_product_attention function, which uses Flash Attention under the hood.
Key Features:
- Memory-efficient attention computation (O(T) instead of O(T²) memory)
- Faster training and inference, especially for longer sequences
- Automatically selects the best available attention backend (FlashAttention, Memory-Efficient Attention, or standard math)
- Identical architecture to Base.py, only the attention mechanism differs
When to use: Use this when you want better memory efficiency and speed, especially for longer sequences or when training on GPU.
Key Concepts:
- Kernel fusion
- Memory-efficient attention algorithms
- Hardware-aware computation
What it does: Replaces the self-attention mechanism with simple MLP (Multi-Layer Perceptron) layers that mix information across tokens.
Key Features:
- No self-attention - uses Token Mixing and Channel Mixing MLPs instead
- All operations are dense matrix multiplications (very hardware-efficient)
- Fixed sequence length requirement (must equal
block_size) - Simpler architecture that may be faster on specialized hardware
When to use: Explore this to understand attention-free architectures and see how simple MLPs can model sequences.
Key Concepts:
- Token mixing (mixing information across sequence positions)
- Channel mixing (mixing information within each token's features)
- Hardware efficiency through dense matrix operations
What it does: Replaces the standard Feed-Forward Network (FFN) with a Mixture of Experts layer, where multiple expert networks work together.
Key Features:
- Multiple expert FFN networks (default: 4 experts)
- Top-k routing (each token is sent to the top-k most relevant experts)
- Sparse activation (only k experts are used per token, not all)
- Auxiliary loss to encourage balanced usage of experts
When to use: Learn about scaling models by increasing capacity without proportional increases in computation.
Key Concepts:
- Sparse activation
- Expert routing
- Load balancing
- Scaling model capacity efficiently
What it does: Trains a smaller "student" model to mimic a larger "teacher" model, transferring knowledge from the teacher to the student.
Key Features:
- Two-stage training: first trains a large teacher model, then a smaller student
- Uses soft labels (teacher predictions) in addition to hard labels (ground truth)
- Temperature scaling to soften probability distributions
- Blends cross-entropy loss with KL divergence loss
When to use: Use this when you want to compress a large model into a smaller one while maintaining performance.
Key Concepts:
- Teacher-student learning
- Soft labels and hard labels
- Temperature scaling
- Knowledge transfer
What it does: Removes less important weights from a trained model by setting them to zero, creating a sparse model.
Key Features:
- One-shot magnitude pruning (removes weights with smallest absolute values)
- Global or per-layer pruning strategies
- Fine-tuning after pruning to recover accuracy
- Reports sparsity statistics before and after pruning
When to use: Use this to reduce model size and potentially speed up inference on hardware that supports sparse operations.
Key Concepts:
- Model sparsity
- Magnitude-based pruning
- Structured vs unstructured pruning
- Fine-tuning after pruning
What it does: Converts model weights and activations from 32-bit floating-point (FP32) to 8-bit integers (INT8) to reduce memory usage and potentially speed up inference.
Key Features:
- SmoothQuant-style calibration to balance activation and weight ranges
- INT8 weight quantization with per-channel scaling
- INT8 activation quantization with per-tensor scaling
- Comparison metrics: model size, memory usage, inference speed, and accuracy
- Export functionality for deployment
When to use: Use this to reduce model size significantly (up to 4x compression) and speed up inference on hardware that supports INT8 operations.
Key Concepts:
- Post-training quantization (PTQ)
- SmoothQuant balancing technique
- Per-channel vs per-tensor quantization
- Calibration and scale factors
Once your environment is set up and input.txt is downloaded, you can run any of the scripts like:
python3 Base.pyNote: Training will take some time depending on your hardware. The scripts will print training progress periodically and generate sample text at the end.
-
Flash Attention: Reduces memory usage from O(T²) to O(T) for sequence length T, enabling longer sequences and faster training.
-
MLP-Mixer: Shows that attention isn't always necessary - simple MLPs can be very efficient for sequence modeling.
-
MoE: Allows creating larger models without proportionally increasing computation by activating only a subset of experts.
-
Knowledge Distillation: Enables deploying smaller, faster models by learning from larger, more accurate models.
-
Pruning: Reduces model size by removing redundant parameters, potentially speeding up inference.
-
Quantization: Compresses models significantly (FP32 → INT8 = 4x reduction) while maintaining reasonable accuracy.
- Read the code comments: Each file is heavily commented to explain what's happening
- Start small: You can reduce
max_itersto train faster for experimentation - Compare results: Run Base.py first, then try optimizations and compare the outputs
- Experiment with hyperparameters: Try changing learning rates, model sizes, etc.
- Check GPU usage: If you have a GPU, monitor its usage with
nvidia-smi(Windows/Linux) or Activity Monitor (macOS)
-
"FileNotFoundError: input.txt"
- Make sure you've downloaded
input.txtto the correct directory - Verify you're running the script from the
Efficient_nanoGPTfolder
- Make sure you've downloaded
-
"ModuleNotFoundError: No module named 'torch'"
- Make sure your virtual environment is activated
- Install PyTorch:
pip install torch
-
Out of Memory errors
- Reduce
batch_sizein the script
- Reduce
-
Slow training on CPU
- Training on CPU is much slower than GPU, so make sure you're using GPU
- Consider using Google Colab (free GPU) for faster experimentation
- Reduce
max_itersfor quick tests
- PyTorch Tutorials
- The Illustrated Transformer
- Attention Is All You Need (Original Paper)
- nanoGPT (Karpathy's Original Implementation)
See LICENSE file for details.