How do LLMs Learn?

April 4, 2025

Want to train a Language Model? Think Basketball.

Training a large language model is a lot like learning to play basketball.

Pre-training is watching game footage, reading rules, and shooting thousands of free throws. Instruction tuning is working with a coach to learn actual plays and improve strategy. Reinforcement learning is when you step on the court, make decisions, miss shots, get feedback, and refine your performance.

This is the journey every modern LLM goes through. It learns the structure of language. Then it learns how to be useful. And finally, it learns what great answers look like.

Let’s walkthrough the process from raw web data to a model that can explain, reason, and write as fluently as a human.

‍

Stage 1: Pre-training - Learning the Patterns of Language

Step 1: Feed It the Internet

Pre-trainingbegins with massive datasets from public web content. Common Crawl is one such source, containing billions of text documents. But raw data needs to be cleaned.

Through filtering, we remove (1) Spam and duplicate content, (2) Harmful or low-qualitysites, (3) HTML and formatting code, and (4) Non-target languages or personal data. Hugging Face’sFineWeb and FineWeb Edu are curated versions of this process.

These datasets simulate the full range of human writing while staying clean and relevant.

‍

Step 2: Convert Text to Tokens

Language is transformed into numbers using tokenization. LLMs do not see characters instead they see tokens. GPT-style models use Byte Pair Encoding (BPE) to break down text into sub-words.

For example: “Let’s recreate a language model together”
Might become: [Let, ’s, recreate, a, language, model, together]
Which then maps to numbers like: 58369, 94905, 261, 6439, 2359, 4717 (depending on the model)

‍

Models have a fixed vocabulary of 50,000 to 100,000 tokens. This allows them to efficiently represent both common and rare words.

‍

Step 3: Teach It to Predict the Next Token & Base Model

The model’s only job at this stage is to guess the next token. It sees a sequence like: [The, dog, plays, with, a]. And tries to predict the next token: ball.

It compares its guess to the true answer, calculates the error, and adjusts itself using backpropagation and gradient descent. This process is repeated across billions of samples. This task is simple, but profound. Because it must predict text in context, the model learns grammar, style, tone, and factual patterns all just by guessing what comes next.

At this stage, the model is a “base model”. It is not an assistant. It is a massive statistical engine that simulates internet text. It does not reason or plan. It just knows what text “should” come next. Its “knowledge” is embedded in its parameters as a vague recollection. It does not memorize full documents, but it can reproduce fragments that resemble training data.

‍

Sampling: How the Model Writes

When generating output, the model samples from a probability distribution. It does not always choose the most likely next token.

This makes the system stochastic i.e., its outputs vary even with the same prompt. It may even generate tokens it has never seen before, as long as they statistically “fit.” This is not a flaw. It is how creativity and generalization emerge. The model builds on its training to create new, inspired sequences.

‍

Stage 2: Instruction Tuning - Teaching It to Be Helpful

Base models are fluent but not reliable. They need guidance to be useful in real-world applications.

Few-Shot Prompting

This technique works, but it is sensitive. The model may follow the pattern, or it may inject unrelated context. We can show the model a few examples of the desired behavior within the prompt.

For example:

Q: What is the capital of France?

A: Paris

Q: What is the capital of Germany?

‍

‍

Supervised Fine-Tuning

A more consistent method is to fine-tune the model on human-written examples of prompts and ideal responses. These are formatted with structure tokens like <|im_start|> to simulate chat flows. The model learns to imitate these helpful, safe behaviors and take on an “assistant” persona.

This tuning is much faster and cheaper than pre-training. It allows the model to adopt personalities, avoid certain topics, or follow instructions better.

‍

Synthetic Data Generation

In many cases, models like GPT-4 or Claude help generate their own fine-tuning data. Human reviewers then edit or approve this synthetic content. Tools like UltraChat rely heavily on this method, creating conversational datasets at scale.

‍

Stage 3: Reinforcement Learning - Teaching It to Judge and Improve

Even after fine-tuning, the model may offer several valid answers to a prompt. But which one is best? To teach preference, LLMs are trained using Reinforcement Learning from Human Feedback (RLHF).

Here’s how:

1. The model generates multiple responses to a prompt

2. Human reviewers rank the responses

3. A reward model learns what “better” looks like

4. The main model is retrained to preferhigh-scoring answers

This teaches the model not just how to complete a sentence, but how to make judgments. Over time, models learn “chain-of-thought reasoning”, breaking down problems into multiple steps. Research shows they often use more tokens near the end of complex tasks, mirroring human thinking. It is an emergent strategy.

‍

The famous AlphaGo system used reinforcement learning to discover strategies that surpassed human play. In the same way, LLMs discover how to explain, reason, andproblem-solve through iteration.

‍

Reinforcement learning surpassing human play

‍

Retrieval-Augmented Generation (RAG) - Giving It Access to Knowledge

LLMs cannot remember everything. They only “know” what is stored in their parameters or what fits in the context window. RAG solves this. In RAG, the model is paired with a search or database tool. When prompted, it fetches relevant documents first, then uses those documents to generate a response.

For example:

· A user asks: “Summarize the latest company meeting”

· The system retrieves the meeting transcript

· The model generates a summary using that context

This allows LLMs to answer questions about up-to-date, private, or highly specific content, without needing to be retrained.

‍

Why Do Models Still Make Mistakes?

Even with all this training, LLMs still hallucinate. They confuse tokens, numbers, or concepts. Why? Because they are not deterministic engines. They are probabilistic predictors.

They do not calculate rather they generate text that looks like a calculation. They might say that 9.11 is greater than 9.9 because 9:11 appears frequently in timestamp formats or other text/formats it has been trained with.

They struggle with:

· Counting (without external tools)

· Spelling (since they see tokens, not letters)

· Knowing when to say “I don’t know”

‍

This is why newer models are paired with:

· Code interpreters for accurate math and logic

· Webaccess for up-to-date knowledge

· Memorysystems to track conversation history

‍

Final Thoughts: The Power of Next Token Prediction

The full lifecycle is as follows:

· Pre-training builds a general understanding of language

· Instruction tuning teaches it to follow directions

· Reinforcement learning shapes how it responds

· Retrieval lets it handle live knowledge

· Tools give it logic, memory, and truth

This is how LLMs go from internet simulators to intelligent communicators. And it all starts with guessing what comes next. With enough data, careful tuning, and feedback, that simple task leads to remarkable capability.

‍