How to Build GPT‑2 from Scratch

April 4, 2025

Introduction

Have you ever wondered how a machine can write stories, answer questions, or hold a conversation like a human? Following Andrej Karpathy’s renowned Zero to Hero tutorials on creating GPT‑2 from scratch, this guide breaks down the core concepts behind GPT‑2 in an accessible way. Combining some key concepts from Python scripting, this post explains how GPT‑2works and provides practical steps so that anyone from a seasoned developer to a curious beginner can experiment with it.

Let the journey into modern AI begin!

The GPT‑2 Blueprint: Breaking Down the Architecture

1. Model and Architecture

Variant Choice
Focus is placed on the 124M version of GPT‑2. The term "124M" refers to roughly 124 million parameters, which are the components the model uses to learn language. Fewer parameters mean faster training and easier comprehension, while larger models (such as GPT‑2 Medium, Large, or XL) provide increased capacity at the cost of greater complexity.

Decoder‑Only Transformer
GPT‑2 employs a transformer architecture that is "decoder‑only." In simple terms, transformers are neural networks that process data in parallel using attention mechanisms. A decoder‑only model predicts the next word in a sequence solely based on the words that precede it. Unlike models used for translation, there is no separate encoder component.

Refer to LLM Visualization for understanding Transformer.

‍

Token and Position Embeddings

Tokenization:
Text is first converted into smaller units known as tokens. GPT‑2 uses a vocabulary of 50,257 tokens. A byte pair encoding algorithm groups repeating symbols into pairs, effectively reducing the overall dataset size.
Example: In the GPT‑4 model, approximately 100,277 symbols or tokens are used.

Tiktokenizer can be used to understand the tokenization of different models. Here, for example, the token for Hi is 12194 for model gpt-4o.

‍

Embeddings:
Each token is mapped to a 768-dimensional vector, which is a list of 768 numbers that captures its meaning much like a unique fingerprint.

Position Embeddings:
Since word order is important, position embeddings add information about where each token appears in a sentence. Together, token and position embeddings give the model both the identity and location of each word. Example in the sentence above “Hi, I am writing a blog post on recreating GPT-2.”, the position of “Hi” is one.

‍

2. The Self‑Attention Mechanism: The Heart of the Model

Self‑Attention:
Imagine every word in a sentence quietly consulting with the words that came before it to understand context. For each token, the model computes three key vectors:

Query: Represents the token’s question and what it needs to know.

Key: Represents potential answers from preceding tokens.

Value: Provides additional information to refine the answer.

The dot product between the query and key vectors yields scores that indicate how much attention each preceding token should receive. A softmax function then converts these scores into probabilities (values between 0 and 1 that add up to 1), guiding the model in selecting the mostrelevant context when predicting the next word.

Both token and position embeddings serve as inputs to the transformer. For example, in the sentence "I love machine learning,"the word "love" (in the second position) is processed alongside its corresponding embeddings. Matrix multiplication between keys and queries, followed by softmax, ultimately leads to the prediction of the subsequent token.

‍

Sequential Processing:
Because GPT‑2 is a decoder‑only model, each token is processed in order from left to right. This means each token only “sees” the tokens that came before it, mirroring how we read and understand sentences.

MLP vs Attention:

3. Stabilization Techniques: Normalization, Residual Connections, and Weight Sharing

Layer Normalization:
Layer normalization is a technique used to standardize the inputs to each layer of the network. It makes training more stable by ensuring that the values do not become too large or too small.

Residual Connections:
Residual connections add the original input of a layer to their output. Imagine you’re building a tower, and you add each new block on top of the previous one, but you also keep a copy of the original blueprint. This helps preserve the original information and makes it easier for the network to learn complex functions.

Weight Sharing:
In GPT‑2, the same embedding matrix is used at both the input and output of the network. This reduces the total number of parameters by around 30 percent and helps the model recognize similarities between tokens (such as different capitalizations of the same word).

‍

4. Sampling: How GPT‑2 Chooses Its Words

Stochastic Sampling Process:
When generating text, GPT‑2 does not simply output the most likely next word. Instead, it samples from a probability distribution.

Stochastic:
The term “stochastic” means that there is randomness involved. Even if you feed the same input twice, you might get slightly different outputs because the model is sampling from probabilities.‍

Why It Matters:
This approach can sometimes generate tokens that were not explicitly present in the training data. Although these tokens are new, they statistically resemble the training data, which adds creative variety to the generated text.

‍

‍5. Training and Optimization: Learning to Generate Text

Backpropagation and Gradient Descent:

Backpropagation:
Imagine watching a replay of your basketball shot to see exactly where you missed the hoop compared to where you aimed. The model computes the difference (or error) between its prediction and the target value for each token similar to identifying where the shot went wrong and then sends that error information backward through the network to adjust its parameters.

Gradient Descent:
Think of it as making small tweaks to your shooting technique in the gameafter each shot. With every adjustment, you get a little closer to consistently hitting the basket. Similarly, gradient descent makes tiny, iterative changes to the model's weights to gradually reduce the error, improving the predictions over time.

‍

Advanced Optimization Techniques:

AdamW Optimizer:
The AdamW optimizer efficiently adjusts model weights and incorporates weight decay to prevent overfitting.

Proper Initialization and Scaling:
For stable training, token embeddings were initialized with a standard deviation of 0.02, and scaling (using 1/√N) was applied to keep activations in check.

‍

Speed Enhancements and Parallel Training:

Techniques:
Using modern methods like bfloat16 precision, torch.compile, and flash attention, we were able to reduce the training time per step from 1,000milliseconds to around 93 milliseconds.

Different formats with different precisions and ranges

Parallel Processing:

‍Leveraging PyTorch’s DistributedDataParallel with torchrun allowed for utilizingmultiple GPUs at once, further accelerating the training process.

‍

Training Data:
Initially, the model is trained on the Shakespeare dataset for quick testing and iteration. However, as the model becomes ready for further training, a more extensive and diverse dataset is required. For large-scale training, popular sources include the Red Pajama dataset and its Slim Pajama subset, which offer a broad collection of web-scraped text. Additionally, datasets like the Fine Web dataset, along with high-quality subsets such as the Fine Web Edu dataset available on Hugging Face, are commonly used. In fact, Andrej Karpathy utilizes the Fine Web Edu subset (referred to as Sample 10B-T) for advanced training, and it can be downloaded directly from Hugging Face.

‍

Early Evaluation of the model trained:

‍

How You Can Do It Too: A Step‑by‑Step Guide

1. Clone the Code: Visit Hugging Face’s GitHub repository and copy the GPT‑2 code from transformers/src/transformers/models/gpt2/modeling_gpt2.py.

2. Set up Your Environment: Ensure you have Python installed. Then, install the required libraries with pip install torch transformers.‍

3. Follow the Tutorials: Follow the Zero to Hero tutorials as your roadmap. These tutorials explain each step from setting up your environment to training and sampling from the model. Let's reproduce GPT-2 (124M).

4. Experiment with Sampling: Try modifying the sampling function in the code to see how changing parameters affects the generated text.‍

5. Optimize and Scale: If you have access to multiple GPUs, experiment with distributed training to speed up the process. GPUs can be rented from: Lambda | GPU Compute for AI

‍

Final Thoughts

Building GPT‑2 is a journey of discovery from understanding token embeddings and self-attention to mastering stochastic sampling and advanced optimization techniques. I extend my sincere thanks to Andrej Karpathy for making the complexities of transformers accessible and fun. This guide aims to deepen your understanding of AI while inspiring you to experiment and create your own AI magic. Remember, every great journey begins with a single token!

‍

BANI MAINI