Skip to content

ruhai-lin/Efficient_nanoGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Efficient_nanoGPT

A comprehensive collection of efficient Transformer implementations based on nanoGPT, designed for educational purposes. This repository demonstrates various optimization techniques for language models, including Flash Attention, MLP-Mixer, Mixture of Experts (MoE), Knowledge Distillation, Pruning, and Quantization.

Table of Contents

  1. Prerequisites
  2. Setup Guide
  3. Code Overview
  4. Running the Code
  5. Understanding the Concepts

Prerequisites

Before getting started, make sure you have:

  • Python 3.8 or higher installed on your system
  • Basic understanding of Python programming
  • Familiarity with neural networks (helpful but not required)

Setup Guide

Creating a Python Virtual Environment

A virtual environment is an isolated Python environment that allows you to manage project-specific dependencies without affecting your system Python installation. Here's how to create one:

On Windows:

  1. Open Command Prompt or PowerShell

    • Press Win + R, type cmd or powershell, and press Enter
  2. Navigate to the project directory

    cd D:\Home\Research\FnT_Tutorial\Efficient_nanoGPT
  3. Create a virtual environment

    python -m venv venv

    This creates a folder named venv containing the virtual environment.

  4. Activate the virtual environment

    venv\Scripts\activate

    You should see (venv) appear at the beginning of your command prompt, indicating the virtual environment is active.

  5. To deactivate later (when you're done working):

    deactivate

On macOS/Linux:

  1. Open Terminal

  2. Navigate to the project directory

    cd /path/to/Efficient_nanoGPT
  3. Create a virtual environment

    python3 -m venv venv
  4. Activate the virtual environment

    source venv/bin/activate

    You should see (venv) at the beginning of your terminal prompt.

  5. To deactivate later:

    deactivate

Installing Dependencies

With your virtual environment activated, install the required packages:

pip install torch numpy

Note: If you have a CUDA-compatible GPU and want to use it, install PyTorch with CUDA support:

  • Visit https://pytorch.org/get-started/locally/
  • Select your configuration and follow the installation instructions
  • Example for CUDA 11.8:
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Downloading Data

Before running any of the scripts, you need to download the training data (Tiny Shakespeare dataset):

On Windows (PowerShell):

Invoke-WebRequest -Uri "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" -OutFile "input.txt"

On macOS/Linux or with wget:

wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Alternative method (if wget is not available):

curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

The input.txt file should be placed in the Efficient_nanoGPT directory (same folder as the Python scripts).


Code Overview

Each Python file in this repository demonstrates a different optimization technique or architectural variation of the Transformer model. All implementations are based on the Tiny Shakespeare dataset for character-level language modeling.

Base.py - Standard Transformer

What it does: This is the baseline implementation of a GPT-style Transformer model using standard self-attention mechanisms.

Key Features:

  • Implements a complete Transformer architecture with multi-head self-attention
  • Character-level language model trained on Tiny Shakespeare
  • Standard training loop with train/validation split
  • Text generation capability after training

When to use: Start here to understand the basic Transformer architecture before exploring optimizations.

Key Concepts:

  • Multi-head self-attention
  • Positional embeddings
  • Feed-forward networks
  • Layer normalization and residual connections

FlashAttn.py - Flash Attention Optimization

What it does: Replaces the standard attention implementation with PyTorch's optimized scaled_dot_product_attention function, which uses Flash Attention under the hood.

Key Features:

  • Memory-efficient attention computation (O(T) instead of O(T²) memory)
  • Faster training and inference, especially for longer sequences
  • Automatically selects the best available attention backend (FlashAttention, Memory-Efficient Attention, or standard math)
  • Identical architecture to Base.py, only the attention mechanism differs

When to use: Use this when you want better memory efficiency and speed, especially for longer sequences or when training on GPU.

Key Concepts:

  • Kernel fusion
  • Memory-efficient attention algorithms
  • Hardware-aware computation

MLPMixer.py - MLP-Mixer Architecture

What it does: Replaces the self-attention mechanism with simple MLP (Multi-Layer Perceptron) layers that mix information across tokens.

Key Features:

  • No self-attention - uses Token Mixing and Channel Mixing MLPs instead
  • All operations are dense matrix multiplications (very hardware-efficient)
  • Fixed sequence length requirement (must equal block_size)
  • Simpler architecture that may be faster on specialized hardware

When to use: Explore this to understand attention-free architectures and see how simple MLPs can model sequences.

Key Concepts:

  • Token mixing (mixing information across sequence positions)
  • Channel mixing (mixing information within each token's features)
  • Hardware efficiency through dense matrix operations

MoE.py - Mixture of Experts

What it does: Replaces the standard Feed-Forward Network (FFN) with a Mixture of Experts layer, where multiple expert networks work together.

Key Features:

  • Multiple expert FFN networks (default: 4 experts)
  • Top-k routing (each token is sent to the top-k most relevant experts)
  • Sparse activation (only k experts are used per token, not all)
  • Auxiliary loss to encourage balanced usage of experts

When to use: Learn about scaling models by increasing capacity without proportional increases in computation.

Key Concepts:

  • Sparse activation
  • Expert routing
  • Load balancing
  • Scaling model capacity efficiently

Distillation.py - Knowledge Distillation

What it does: Trains a smaller "student" model to mimic a larger "teacher" model, transferring knowledge from the teacher to the student.

Key Features:

  • Two-stage training: first trains a large teacher model, then a smaller student
  • Uses soft labels (teacher predictions) in addition to hard labels (ground truth)
  • Temperature scaling to soften probability distributions
  • Blends cross-entropy loss with KL divergence loss

When to use: Use this when you want to compress a large model into a smaller one while maintaining performance.

Key Concepts:

  • Teacher-student learning
  • Soft labels and hard labels
  • Temperature scaling
  • Knowledge transfer

Pruning.py - Weight Pruning

What it does: Removes less important weights from a trained model by setting them to zero, creating a sparse model.

Key Features:

  • One-shot magnitude pruning (removes weights with smallest absolute values)
  • Global or per-layer pruning strategies
  • Fine-tuning after pruning to recover accuracy
  • Reports sparsity statistics before and after pruning

When to use: Use this to reduce model size and potentially speed up inference on hardware that supports sparse operations.

Key Concepts:

  • Model sparsity
  • Magnitude-based pruning
  • Structured vs unstructured pruning
  • Fine-tuning after pruning

Quantization.py - INT8 Quantization

What it does: Converts model weights and activations from 32-bit floating-point (FP32) to 8-bit integers (INT8) to reduce memory usage and potentially speed up inference.

Key Features:

  • SmoothQuant-style calibration to balance activation and weight ranges
  • INT8 weight quantization with per-channel scaling
  • INT8 activation quantization with per-tensor scaling
  • Comparison metrics: model size, memory usage, inference speed, and accuracy
  • Export functionality for deployment

When to use: Use this to reduce model size significantly (up to 4x compression) and speed up inference on hardware that supports INT8 operations.

Key Concepts:

  • Post-training quantization (PTQ)
  • SmoothQuant balancing technique
  • Per-channel vs per-tensor quantization
  • Calibration and scale factors

Running the Code

Once your environment is set up and input.txt is downloaded, you can run any of the scripts like:

python3 Base.py

Note: Training will take some time depending on your hardware. The scripts will print training progress periodically and generate sample text at the end.


Understanding the Concepts

Why These Optimizations Matter

  1. Flash Attention: Reduces memory usage from O(T²) to O(T) for sequence length T, enabling longer sequences and faster training.

  2. MLP-Mixer: Shows that attention isn't always necessary - simple MLPs can be very efficient for sequence modeling.

  3. MoE: Allows creating larger models without proportionally increasing computation by activating only a subset of experts.

  4. Knowledge Distillation: Enables deploying smaller, faster models by learning from larger, more accurate models.

  5. Pruning: Reduces model size by removing redundant parameters, potentially speeding up inference.

  6. Quantization: Compresses models significantly (FP32 → INT8 = 4x reduction) while maintaining reasonable accuracy.

Tips for Beginners

  • Read the code comments: Each file is heavily commented to explain what's happening
  • Start small: You can reduce max_iters to train faster for experimentation
  • Compare results: Run Base.py first, then try optimizations and compare the outputs
  • Experiment with hyperparameters: Try changing learning rates, model sizes, etc.
  • Check GPU usage: If you have a GPU, monitor its usage with nvidia-smi (Windows/Linux) or Activity Monitor (macOS)

Troubleshooting

Common Issues

  1. "FileNotFoundError: input.txt"

    • Make sure you've downloaded input.txt to the correct directory
    • Verify you're running the script from the Efficient_nanoGPT folder
  2. "ModuleNotFoundError: No module named 'torch'"

    • Make sure your virtual environment is activated
    • Install PyTorch: pip install torch
  3. Out of Memory errors

    • Reduce batch_size in the script
  4. Slow training on CPU

    • Training on CPU is much slower than GPU, so make sure you're using GPU
    • Consider using Google Colab (free GPU) for faster experimentation
    • Reduce max_iters for quick tests

Additional Resources


License

See LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages