A Super-Friendly Tutorial for Curious Learners, Students, and Practitioners
Author: Rimom Costa
This is not your typical AI tutorial. This is a complete, from-first-principles guide to understanding transformers—the architecture powering ChatGPT, Claude, DALL-E, and virtually every modern AI breakthrough.
What sets this apart:
- Hand-calculable examples - Uses tiny numbers (6 dimensions instead of 12,288) so you can verify EVERY calculation by hand
- Accessible without being shallow - Explains concepts clearly enough for beginners while staying useful for serious technical readers
- Complete coverage - From tokenization to training and ChatGPT-style assistant alignment (including RLHF)
- No magic boxes - Every formula explained, every design choice justified
- Intuitive learning - Analogies, examples, and plain-language explanations throughout
- Career-focused - Teaches what you'll ACTUALLY do in industry (hint: not training from scratch!)
After this course, you'll understand:
- How ChatGPT actually works under the hood
- Why transformers revolutionized AI
- The complete training pipeline (pre-training → fine-tuning → RLHF)
- How to implement transformers from scratch
- What makes GPT different from BERT and T5
Click on any course section to jump directly to that part of the course!
- Math Symbols Quick Reference: Your Decoder Ring - The notation guide to read before the course
- Introduction: What Are Transformers Really? - The big-picture starting point
- Chapter 0: The Grand Vision - What problem are we solving?
- Chapter 1: Building Our Vocabulary - The token dictionary
- Chapter 2: Tokenization - Byte-Pair Encoding (BPE) explained
- Chapter 3: Embeddings - Giving numbers meaning
- Chapter 4: Positional Encoding - Teaching word order with sine waves
- Part 1: Understanding the Core Problem
- Part 2: Failed Attempts (Learning from Mistakes)
- Part 3: The Breakthrough — Understanding Waves
- Part 4: Building the Solution Step-by-Step
- Part 5: Calculating Position Encodings (Hands-On!)
- Part 6: Combining Position with Word Meaning
- Part 7: Critical Question About Training
- Part 8: Why This Solution Is Beautiful
- Part 9: Summary and Key Takeaways
- Chapter 5: Multi-Head Self-Attention - The heart of transformers
- Chapter 6: Dropout - The training safety net
- Chapter 7: Feed-Forward Network - Individual word processing
- Chapter 8: Residual Connections & Layer Normalization - Gradient highways
- Chapter 9: Stacking Transformer Blocks - Building depth
- Chapter 10: The Output Head - Predicting the next token
- Chapter 11: Training the Transformer - Backpropagation, loss functions, optimizers
- Chapter 12: Causal Masking - Preventing the model from cheating
- Chapter 13: Inference - Using the trained model (with KV cache optimization!)
- Chapter 14: All the Hyperparameters - The complete control panel
- Chapter 15: Additional Techniques - Gradient accumulation, mixed precision, checkpointing
- Chapter 16: Common Training Problems & Solutions - Debugging guide
- Chapter 17: Putting It All Together - Complete end-to-end example
- Chapter 18: From Language Model to ChatGPT - The three training stages (pre-training, fine-tuning, RLHF)
- Chapter 19: Three Transformer Architectures - Understanding encoder vs decoder vs encoder-decoder
- Part 1: Understanding the Three Architectures
- Part 2: Key Differences Explained
- Part 3: Which One Should You Use?
- Part 4: What Makes Decoder-Only Special (What You Learned)
- Part 5: The Missing Piece (Cross-Attention in Encoder-Decoder)
- Part 6: What You Learned vs What Exists
- Part 7: Modern Landscape (What's Actually Used)
- Part 8: Summary - What You Actually Know
- Chapter 20: Quick Quizzes - Test your understanding
- Chapter 21: Going Further - Next steps and resources
- Complete beginners who want to understand AI from first principles
- Software engineers transitioning into ML/AI
- Students learning about transformers in courses
- Researchers who want to understand the fundamentals deeply
- Educators looking for teaching materials
- Curious minds who want to know how ChatGPT actually works
- Minimal: Basic arithmetic (addition, multiplication)
- Helpful but not required: High school algebra, basic Python
- Not required: Advanced calculus, linear algebra, or ML experience
The tutorial builds everything from the ground up, explaining even the math notation!
After completing this tutorial, you will:
- ✅ Understand how words become numbers (embeddings)
- ✅ Grasp how transformers know word order (positional encoding)
- ✅ Master self-attention and why it's revolutionary
- ✅ Understand the complete training process (loss, gradients, backpropagation)
- ✅ Know the difference between pre-training and fine-tuning
- ✅ Understand how ChatGPT-style assistants differ from base GPT-style language models
- ✅ Be able to implement a transformer from scratch
- ✅ Read and understand modern AI research papers
- ✅ Debug common training issues
- ✅ Know what you'll actually do in an AI/ML career
-
Clone this repository
git clone git@github.com:rimomcosta/Building_a_Transformer.git cd Building_a_Transformer -
Read the course
- Start with
00-introduction.md, then read the chapter files in order - Grab paper and pencil to follow along with calculations
- Take your time—understanding is more important than speed!
- Start with
-
Try the calculations yourself
- Don't just read—actually calculate the examples
- The numbers are small enough to do by hand
- This is where real understanding happens!
Every concept aims to do both jobs at once:
- Build intuition with concrete analogies and plain language
- Preserve the real mathematics and implementation logic
- Avoid splitting into separate "simple" and "technical" versions
Unlike most tutorials that use production-scale dimensions:
- Typical tutorial: "Imagine a 768-dimensional vector..." (impossible to calculate)
- This tutorial: "Here's a 6-dimensional vector: [0.2, -0.1, 0.5, 0.3, -0.4, 0.1]" (you can verify every step!)
Most guides only show inference (using a trained model). This tutorial covers:
- How the model learns (training)
- How it predicts (inference)
- How to optimize for production (KV cache)
Explains what you'll ACTUALLY do in industry:
- You DON'T train frontier-scale models from scratch ($10M+ to much higher budgets, depending on the system)
- You DO fine-tune existing models ($100-$1000 budget) or use other efficient adaptation techniques when appropriate
- Understanding the difference is critical for career success!
- Covers decoder-only transformers (GPT-style), the most common foundation behind modern LLMs
- Explains the core training stages behind ChatGPT-style assistants
- Includes recent optimizations (KV cache, flash attention concepts)
First off, thank you for considering contributing to this project!
This tutorial aims to be the most accessible and comprehensive transformer guide ever created. Every contribution—whether it's fixing a typo, clarifying an explanation, or creating interactive examples—helps make AI education more accessible.
- Reading flow: Improve navigation, section order, and transitions
- Accessibility: Make explanations easier to follow for different learning backgrounds
- Interactive examples: Build optional demos that reinforce the written course
- Web Design: Help create a beautiful, accessible website
- UI/UX: Improve the study experience
- PyTorch Implementation: Complete, commented implementation matching the tutorial
- Jupyter Notebooks: Interactive notebooks with step-by-step execution
- Visualization Tools: Interactive demos of attention, embeddings, etc.
- Web Demos: Browser-based implementations
- Testing Frameworks: Tools to verify calculations
- Proofreading: Fix typos, grammar, and clarity issues
- Additional Examples: Add more worked examples
- Analogies: Suggest better analogies for difficult concepts
- Exercises: Create practice problems with solutions
- Quizzes: Interactive self-assessment tools
- Translations: Translate to other languages
- API Documentation: If code is added, document it thoroughly
- Setup Guides: Help others get started with implementations
- Troubleshooting: Document common issues and solutions
- FAQ: Add frequently asked questions
- Beta Testing: Work through the tutorial and report issues
- Accuracy Review: Verify mathematical correctness
- Pedagogical Review: Test with diverse audiences
- Accessibility Review: Ensure content is accessible to all
- Check existing issues: Someone might already be working on it
- Open an issue: Discuss major changes before investing time
- Read the license: Understand the licensing terms
- Keep the tone: Maintain the friendly, accessible style
-
Fork the repository
git clone git@github.com:YOUR_USERNAME/Building_a_Transformer.git cd Building_a_Transformer -
Create a branch
git checkout -b feature/your-feature-name # or git checkout -b fix/your-bug-fix -
Make your changes
- Write clear, descriptive commit messages
- Keep changes focused (one feature/fix per PR)
- Test your changes thoroughly
-
Commit your changes
git add . git commit -m "Add: brief description of changes"
-
Push to your fork
git push origin feature/your-feature-name
-
Open a Pull Request
- Provide a clear description of changes
- Reference any related issues
- Explain why the change is valuable
Use clear, descriptive commit messages:
Add: new section on flash attentionFix: typo in chapter 5, paragraph 3Improve: clarity of attention mechanism explanationUpdate: explanation for multi-head attentionDocs: add setup instructions for PyTorch
- Python: Follow PEP 8, use type hints
- Comments: Explain WHY, not just WHAT
- Naming: Clear, descriptive variable names
- Documentation: Docstrings for all functions/classes
Example:
def calculate_attention_scores(query: np.ndarray, key: np.ndarray) -> np.ndarray:
"""
Calculate attention scores between query and key vectors.
Args:
query: Query vector of shape (d_k,)
key: Key vector of shape (d_k,)
Returns:
Attention score (scalar value)
Example:
>>> query = np.array([1.0, 0.5])
>>> key = np.array([0.8, 0.6])
>>> calculate_attention_scores(query, key)
1.1
"""
return np.dot(query, key)- Friendly and encouraging: "Great! Now let's see..."
- Avoid condescension: Never "Obviously..." or "Simply..."
- Inclusive language: Use "we" and "our"
- Short paragraphs: 3-5 sentences max
- Clear headings: Descriptive and hierarchical
- Examples first: Show, then explain
- Unified explanation style: Build plain-language intuition while preserving the real technical detail
- Bold for emphasis
codefor technical terms-
Blockquotes for important notes
- Lists for clarity
- Tables for comparisons
- Define symbols before using them
- Use LaTeX for complex equations:
$E = mc^2$ - Show numerical examples after formulas
- Explain in words what the math means
Example:
#### Understanding the Dot Product
The **dot product** measures similarity between two vectors.
**Formula:**
$\text{score} = \vec{q} \cdot \vec{k} = q_1 k_1 + q_2 k_2 + ... + q_n k_n$
**Example:**
Query: $\vec{q} = [1.0, 0.5]$
Key: $\vec{k} = [0.8, 0.6]$
$\text{score} = (1.0 \times 0.8) + (0.5 \times 0.6) = 0.8 + 0.3 = 1.1$
**Intuition:** Higher scores mean the vectors point in similar directions!-
Content Review and Text Clarity
- Verify mathematical accuracy
- Improve explanations while preserving the unified tone
- Add examples that help both beginners and technical readers
- Keep concepts clear without splitting the course into separate tracks
-
Interactive Web Version
- Responsive design
- Table of contents with smooth scrolling
- Code syntax highlighting
- Mobile-friendly layout
- Dark mode support
-
PyTorch Implementation
- Heavily commented code matching the tutorial
- Step-by-step execution examples
- Debugging utilities
- Visualization hooks
-
Video Walkthroughs
- Key concept explanations
- Hand-calculation demonstrations
- Step-through of complete examples
-
Jupyter Notebooks
- Interactive exercises
- Executable code cells
- Inline visualizations
-
Additional Examples
- More sentence processing examples
- Different language examples
- Edge cases and corner cases
-
Exercises & Quizzes
- Progressive difficulty levels
- Immediate feedback
- Explanations for wrong answers
-
Translations
- Spanish
- Portuguese
- Mandarin
- Hindi
- French
- German
-
Advanced Topics
- Flash Attention
- Mixture of Experts
- Sparse Attention
- Efficient Transformers
-
Related Architectures
- Vision Transformers (ViT)
- Diffusion Transformers
- Multimodal transformers
Found an error? Please help us fix it!
- What: Quote the problematic text
- Where: Chapter and section
- Issue: What's wrong (typo, factual error, clarity)
- Suggestion: How to fix it (if you have one)
- Environment: OS, Python version, dependencies
- Steps to reproduce: Exact steps to trigger the bug
- Expected behavior: What should happen
- Actual behavior: What actually happens
- Error messages: Full error output
Have an idea? We'd love to hear it!
Good enhancement suggestions include:
- Clear description of the enhancement
- Explanation of why it's valuable
- Examples of how it would work
- Consideration of alternatives
By contributing, you agree to the terms in the License:
- Your contribution will be credited
- You retain copyright to your work
- You grant a royalty-free license to the author
- You affirm you have the right to contribute
Significant contributors will be acknowledged in the README and documentation.
Stuck? Need guidance?
- Discussions: For questions and general discussion
- Issues: For specific bugs or feature requests
- Email: For private inquiries: rimomcosta@gmail.com
We pledge to make participation in this project a harassment-free experience for everyone, regardless of:
- Age
- Body size
- Disability
- Ethnicity
- Gender identity and expression
- Level of experience
- Nationality
- Personal appearance
- Race
- Religion
- Sexual identity and orientation
Positive behavior:
- Being respectful and inclusive
- Accepting constructive criticism gracefully
- Focusing on what's best for the community
- Showing empathy toward others
Unacceptable behavior:
- Trolling, insulting, or derogatory comments
- Public or private harassment
- Publishing others' private information
- Other conduct inappropriate in a professional setting
Project maintainers have the right to remove, edit, or reject comments, commits, code, issues, and other contributions that don't align with this Code of Conduct.
Contributors will be recognized in several ways:
- GitHub Contributors: Automatically listed by GitHub
- README Acknowledgments: Special recognition for significant contributions
- In-document Attribution: For major content additions
Building_a_Transformer/
├── README.md
├── 00-introduction.md
├── chapter-00-grand-vision.md
├── chapter-01-building-our-vocabulary.md
├── chapter-02-tokenization.md
├── chapter-03-embeddings.md
├── chapter-04-positional-encoding.md
├── chapter-05-multi-head-self-attention.md
├── chapter-06-dropout.md
├── chapter-07-feed-forward-network.md
├── chapter-08-residual-connections-layer-normalization.md
├── chapter-09-stacking-transformer-blocks.md
├── chapter-10-output-head.md
├── chapter-11-training-the-transformer.md
├── chapter-12-causal-masking.md
├── chapter-13-inference.md
├── chapter-14-all-the-hyperparameters.md
├── chapter-15-additional-techniques.md
├── chapter-16-common-training-problems-solutions.md
├── chapter-17-putting-it-all-together.md
├── chapter-18-from-language-model-to-chatgpt.md
├── chapter-19-three-transformer-architectures.md
├── chapter-20-quick-quizzes.md
├── chapter-21-going-further.md
└── appendix-math-symbols-quick-reference.md
Want to contribute but need to learn more first?
- Git & GitHub: GitHub Docs
- Markdown: Markdown Guide
- LaTeX Math: LaTeX Math Symbols
- Transformers: Start with
00-introduction.md, then read the chapter files in order
- Initial review: Within 1 week
- Feedback: Clear, constructive comments
- Iterations: Work together to refine
- Merge: Once approved by maintainer
- Recognition: Credit added to project
Every contribution, no matter how small, makes this resource better for learners worldwide. You're helping democratize AI education!
Together, we can make AI accessible to everyone.
Questions about contributing? Open an issue with the "question" label!
The complete written tutorial is split into chapter files listed in the course contents
- Interactive web version with navigation
- PyTorch implementation with step-by-step comments
- Text examples and exercises for each chapter
- Video walkthroughs of key concepts
- Optional interactive attention demos for the web version
- Jupyter notebooks with executable examples
- Exercise sets with solutions
- Translation to other languages
Want to help with any of these? Open an issue or submit a PR!
- Attention Is All You Need - The original transformer paper (2017)
- Language Models are Few-Shot Learners - The GPT-3 paper
- Training language models to follow instructions with human feedback - InstructGPT/ChatGPT
- Instruction-Level Weight Shaping (ILWS) - A newer framework for self-improving AI agents and efficient adaptation (2025)
- The Illustrated Transformer - Visual explanation
- nanochat - Complete end-to-end ChatGPT pipeline by Andrej Karpathy (~8K lines of code, trains in 4 hours on 8×H100)
- The Annotated Transformer - Line-by-line implementation
- Transformers Explained by Serrano.Academy - Excellent visual walkthrough of transformer architecture
- Neural Networks Series by 3Blue1Brown - Beautiful mathematical intuition for neural networks
- MIT Introduction to Deep Learning - Comprehensive deep learning course with lectures and labs
- Stanford CS224N - Natural Language Processing with Deep Learning
- Fast.ai - Practical Deep Learning for Coders
This tutorial stands on the shoulders of giants:
- The original Google Brain team for the transformer architecture
- OpenAI for GPT and the insights into scaling laws
- Anthropic for Claude and research on AI safety
- Andrej Karpathy for making AI education accessible
- Luis Serrano (Serrano.Academy) for exceptional visual explanations of transformers
- Grant Sanderson (3Blue1Brown) for beautiful mathematical intuition in neural networks
- MIT 6.S191 team for their comprehensive Introduction to Deep Learning course
- The entire ML research community for open research
Special thanks to everyone who has provided feedback, found bugs, or contributed improvements!
- Author: Rimom Costa - rimomcosta@gmail.com
- Issues: Found a bug or have a suggestion? Open an issue
- Discussions: Questions or want to chat? Start a discussion
MIT License
Copyright (c) 2025 Rimom Costa
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
If this tutorial helped you:
- Star this repository to help others find it
- Share it with your network
- Contribute improvements or corrections
- Support the author (sponsorship options coming soon)
Together, we can make AI education accessible to everyone!
Built with love for the AI learning community
"The best way to understand transformers is to build one yourself"