Efficient_nanoGPT

A comprehensive collection of efficient Transformer implementations based on nanoGPT, designed for educational purposes. This repository demonstrates various optimization techniques for language models, including Flash Attention, MLP-Mixer, Mixture of Experts (MoE), Knowledge Distillation, Pruning, and Quantization.

Prerequisites

Before getting started, make sure you have:

Python 3.8 or higher installed on your system
Basic understanding of Python programming
Familiarity with neural networks (helpful but not required)

Setup Guide

Creating a Python Virtual Environment

A virtual environment is an isolated Python environment that allows you to manage project-specific dependencies without affecting your system Python installation. Here's how to create one:

On Windows:

Open Command Prompt or PowerShell
- Press Win + R, type cmd or powershell, and press Enter

Navigate to the project directory

cd D:\Home\Research\FnT_Tutorial\Efficient_nanoGPT

Create a virtual environment
```
python -m venv venv
```
This creates a folder named venv containing the virtual environment.
Activate the virtual environment
```
venv\Scripts\activate
```
You should see (venv) appear at the beginning of your command prompt, indicating the virtual environment is active.
To deactivate later (when you're done working):
```
deactivate
```

On macOS/Linux:

Open Terminal
Navigate to the project directory
```
cd /path/to/Efficient_nanoGPT
```
Create a virtual environment
```
python3 -m venv venv
```
Activate the virtual environment
```
source venv/bin/activate
```
You should see (venv) at the beginning of your terminal prompt.
To deactivate later:
```
deactivate
```

Installing Dependencies

With your virtual environment activated, install the required packages:

pip install torch numpy

Note: If you have a CUDA-compatible GPU and want to use it, install PyTorch with CUDA support:

Visit https://pytorch.org/get-started/locally/
Select your configuration and follow the installation instructions

Example for CUDA 11.8:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Downloading Data

Before running any of the scripts, you need to download the training data (Tiny Shakespeare dataset):

On Windows (PowerShell):

Invoke-WebRequest -Uri "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" -OutFile "input.txt"

On macOS/Linux or with wget:

wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Alternative method (if wget is not available):

curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

The input.txt file should be placed in the Efficient_nanoGPT directory (same folder as the Python scripts).

Code Overview

Each Python file in this repository demonstrates a different optimization technique or architectural variation of the Transformer model. All implementations are based on the Tiny Shakespeare dataset for character-level language modeling.

Base.py - Standard Transformer

What it does: This is the baseline implementation of a GPT-style Transformer model using standard self-attention mechanisms.

Key Features:

Implements a complete Transformer architecture with multi-head self-attention
Character-level language model trained on Tiny Shakespeare
Standard training loop with train/validation split
Text generation capability after training

When to use: Start here to understand the basic Transformer architecture before exploring optimizations.

Key Concepts:

Multi-head self-attention
Positional embeddings
Feed-forward networks
Layer normalization and residual connections

FlashAttn.py - Flash Attention Optimization

What it does: Replaces the standard attention implementation with PyTorch's optimized scaled_dot_product_attention function, which uses Flash Attention under the hood.

Key Features:

Memory-efficient attention computation (O(T) instead of O(T²) memory)
Faster training and inference, especially for longer sequences
Automatically selects the best available attention backend (FlashAttention, Memory-Efficient Attention, or standard math)
Identical architecture to Base.py, only the attention mechanism differs

When to use: Use this when you want better memory efficiency and speed, especially for longer sequences or when training on GPU.

Key Concepts:

Kernel fusion
Memory-efficient attention algorithms
Hardware-aware computation

MLPMixer.py - MLP-Mixer Architecture

What it does: Replaces the self-attention mechanism with simple MLP (Multi-Layer Perceptron) layers that mix information across tokens.

Key Features:

No self-attention - uses Token Mixing and Channel Mixing MLPs instead
All operations are dense matrix multiplications (very hardware-efficient)
Fixed sequence length requirement (must equal block_size)
Simpler architecture that may be faster on specialized hardware

When to use: Explore this to understand attention-free architectures and see how simple MLPs can model sequences.

Key Concepts:

Token mixing (mixing information across sequence positions)
Channel mixing (mixing information within each token's features)
Hardware efficiency through dense matrix operations

MoE.py - Mixture of Experts

What it does: Replaces the standard Feed-Forward Network (FFN) with a Mixture of Experts layer, where multiple expert networks work together.

Key Features:

Multiple expert FFN networks (default: 4 experts)
Top-k routing (each token is sent to the top-k most relevant experts)
Sparse activation (only k experts are used per token, not all)
Auxiliary loss to encourage balanced usage of experts

When to use: Learn about scaling models by increasing capacity without proportional increases in computation.

Key Concepts:

Sparse activation
Expert routing
Load balancing
Scaling model capacity efficiently

Distillation.py - Knowledge Distillation

What it does: Trains a smaller "student" model to mimic a larger "teacher" model, transferring knowledge from the teacher to the student.

Key Features:

Two-stage training: first trains a large teacher model, then a smaller student
Uses soft labels (teacher predictions) in addition to hard labels (ground truth)
Temperature scaling to soften probability distributions
Blends cross-entropy loss with KL divergence loss

When to use: Use this when you want to compress a large model into a smaller one while maintaining performance.

Key Concepts:

Teacher-student learning
Soft labels and hard labels
Temperature scaling
Knowledge transfer

Pruning.py - Weight Pruning

What it does: Removes less important weights from a trained model by setting them to zero, creating a sparse model.

Key Features:

One-shot magnitude pruning (removes weights with smallest absolute values)
Global or per-layer pruning strategies
Fine-tuning after pruning to recover accuracy
Reports sparsity statistics before and after pruning

When to use: Use this to reduce model size and potentially speed up inference on hardware that supports sparse operations.

Key Concepts:

Model sparsity
Magnitude-based pruning
Structured vs unstructured pruning
Fine-tuning after pruning

Quantization.py - INT8 Quantization

What it does: Converts model weights and activations from 32-bit floating-point (FP32) to 8-bit integers (INT8) to reduce memory usage and potentially speed up inference.

Key Features:

SmoothQuant-style calibration to balance activation and weight ranges
INT8 weight quantization with per-channel scaling
INT8 activation quantization with per-tensor scaling
Comparison metrics: model size, memory usage, inference speed, and accuracy
Export functionality for deployment

When to use: Use this to reduce model size significantly (up to 4x compression) and speed up inference on hardware that supports INT8 operations.

Key Concepts:

Post-training quantization (PTQ)
SmoothQuant balancing technique
Per-channel vs per-tensor quantization
Calibration and scale factors

Running the Code

Once your environment is set up and input.txt is downloaded, you can run any of the scripts like:

python3 Base.py

Note: Training will take some time depending on your hardware. The scripts will print training progress periodically and generate sample text at the end.

Understanding the Concepts

Why These Optimizations Matter

Flash Attention: Reduces memory usage from O(T²) to O(T) for sequence length T, enabling longer sequences and faster training.
MLP-Mixer: Shows that attention isn't always necessary - simple MLPs can be very efficient for sequence modeling.
MoE: Allows creating larger models without proportionally increasing computation by activating only a subset of experts.
Knowledge Distillation: Enables deploying smaller, faster models by learning from larger, more accurate models.
Pruning: Reduces model size by removing redundant parameters, potentially speeding up inference.
Quantization: Compresses models significantly (FP32 → INT8 = 4x reduction) while maintaining reasonable accuracy.

Tips for Beginners

Read the code comments: Each file is heavily commented to explain what's happening
Start small: You can reduce max_iters to train faster for experimentation
Compare results: Run Base.py first, then try optimizations and compare the outputs
Experiment with hyperparameters: Try changing learning rates, model sizes, etc.
Check GPU usage: If you have a GPU, monitor its usage with nvidia-smi (Windows/Linux) or Activity Monitor (macOS)

Troubleshooting

Common Issues

"FileNotFoundError: input.txt"
- Make sure you've downloaded input.txt to the correct directory
- Verify you're running the script from the Efficient_nanoGPT folder
"ModuleNotFoundError: No module named 'torch'"
- Make sure your virtual environment is activated
- Install PyTorch: pip install torch
Out of Memory errors
- Reduce batch_size in the script
Slow training on CPU
- Training on CPU is much slower than GPU, so make sure you're using GPU
- Consider using Google Colab (free GPU) for faster experimentation
- Reduce max_iters for quick tests

Additional Resources

License

See LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient_nanoGPT

Table of Contents

Prerequisites

Setup Guide

Creating a Python Virtual Environment

On Windows:

On macOS/Linux:

Installing Dependencies

Downloading Data

Code Overview

Base.py - Standard Transformer

FlashAttn.py - Flash Attention Optimization

MLPMixer.py - MLP-Mixer Architecture

MoE.py - Mixture of Experts

Distillation.py - Knowledge Distillation

Pruning.py - Weight Pruning

Quantization.py - INT8 Quantization

Running the Code

Understanding the Concepts

Why These Optimizations Matter

Tips for Beginners

Troubleshooting

Common Issues

Additional Resources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
Base.py		Base.py
Distillation.py		Distillation.py
FlashAttn.py		FlashAttn.py
LICENSE		LICENSE
MLPMixer.py		MLPMixer.py
MoE.py		MoE.py
Pruning.py		Pruning.py
Quantization.py		Quantization.py
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Efficient_nanoGPT

Table of Contents

Prerequisites

Setup Guide

Creating a Python Virtual Environment

On Windows:

On macOS/Linux:

Installing Dependencies

Downloading Data

Code Overview

Base.py - Standard Transformer

FlashAttn.py - Flash Attention Optimization

MLPMixer.py - MLP-Mixer Architecture

MoE.py - Mixture of Experts

Distillation.py - Knowledge Distillation

Pruning.py - Weight Pruning

Quantization.py - INT8 Quantization

Running the Code

Understanding the Concepts

Why These Optimizations Matter

Tips for Beginners

Troubleshooting

Common Issues

Additional Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages