How GPT Actually Generates Text

Part 4 of the Attention & Transformers Deep Dive Series


Introduction

At this point in the series, we understand:

  • Embeddings
  • Self-attention
  • Query, Key, and Value vectors
  • Multi-head attention
  • Positional encoding

But we still have not answered one of the biggest practical questions:

How does GPT actually generate text?

When you type:

Explain quantum computing

how does the model continue producing:

  • Coherent paragraphs
  • Code
  • Reasoning
  • Structured responses

one token at a time?

And perhaps even more importantly:

How does the model avoid cheating during training?

That is where:

  • Masked attention
  • Autoregressive generation
  • Inference sampling

enter the picture.

This post explains the mechanics behind modern GPT-style text generation.


The Core Challenge

Regular self-attention allows every token to attend to every other token.

That works well for understanding text.

But generation introduces a problem.

Suppose the model is training on:

"The cat sat on the mat"

When predicting:

mat

the model must NOT already see:

mat

Otherwise the task becomes trivial.

The model would simply copy the answer.


GPT’s Solution: Causal Masking

GPT uses masked self-attention also called causal attention

The rule is simple:

tokens can only attend to previous tokens.

Future tokens are hidden.


Visual Intuition

Suppose sequence:

The cat sat

Attention permissions become:

TokenCan Attend To
TheThe
catThe, cat
satThe, cat, sat

Future positions remain inaccessible.


Attention Matrix View

Normal bidirectional attention:

✓ ✓ ✓
✓ ✓ ✓
✓ ✓ ✓

Causal masked attention:

✓ ✗ ✗
✓ ✓ ✗
✓ ✓ ✓

This creates an upper-triangular masking structure.


Why This Matters

Causal masking transforms the Transformer into an autoregressive model

Meaning:

predict the next token from previous tokens.

This became the foundation of:

  • GPT
  • Modern chat models
  • Coding copilots
  • Text generation systems

How Masking Works Internally

Before softmax forbidden positions receive huge negative values.

Conceptually

Suppose raw attention scores:

[2,5,1]

If third position is forbidden:

[2,5,-\infty]

After softmax:

[0.05,0.95,0]

The forbidden token receives:

  • Zero probability
  • Zero attention weight

This is how the model avoids peeking into the future.


GPT Training Objective

The training task becomes:

predict the next token

repeated billions or trillions of times.


Example

Input: "The cat sat on the" Target: mat

Then:

Input: "The cat sat on the mat" Target: and

and so on.


A Huge Transformer Advantage

Even though generation is sequential, training remains highly parallelizable.

This was one of the biggest Transformer breakthroughs.


Why Parallel Training Matters

RNNs process sequences step-by-step.

Transformers can process entire sequences simultaneously during training because masking preserves causality mathematically.

This dramatically improved:

  • GPU utilization
  • Scaling efficiency
  • Training throughput

And that enabled internet-scale language modeling.


What GPT Is Really Learning

GPT is fundamentally learning:

probability distributions over next tokens.

At every step, the model predicts a probability distribution across the vocabulary.

Example:

TokenProbability
mat0.62
floor0.18
chair0.05
moon0.001

The model does not directly generate text.

It generates probability distributions

Text emerges through sampling.


Generation Happens Token-by-Token

Suppose prompt: "The cat"

The model predicts:

TokenProbability
sat0.55
ran0.20
slept0.10

Suppose: sat gets selected.

Now prompt becomes:

"The cat sat"

Then the process repeats.

This loop continues:

  • One token at a time
  • Until stopping conditions occur.

Why GPT Feels Intelligent

This is one of the most misunderstood aspects of LLMs.

GPT is not:

  • Explicitly reasoning symbolically
  • Searching a database internally
  • Executing logical proofs

Instead sophisticated reasoning patterns emerge from large-scale next-token prediction.

This distinction matters enormously.


Temperature: Controlling Randomness

One of the most important inference settings is temperature.

Temperature controls:

  • Randomness
  • Creativity
  • Confidence

Low Temperature

Example:

T = 0.2

Effects:

  • More deterministic
  • Safer
  • More factual
  • Less creative

Probability distributions become sharper.


High Temperature

Example:

T = 1.5

Effects:

  • More exploratory
  • More creative
  • More surprising
  • Higher hallucination risk

Probability distributions become flatter.


Intuition Example

Original probabilities:

TokenProbability
mat0.60
floor0.25
moon0.01

Low temperature:

TokenProbability
mat0.90
floor0.09
moon~0

High temperature:

TokenProbability
mat0.40
floor0.30
moon0.10

Why Temperature Matters

Low temperature is useful for:

  • Coding
  • Structured outputs
  • Factual tasks
  • Deterministic workflows

High temperature is useful for:

  • Brainstorming
  • Storytelling
  • Creativity
  • Divergent exploration

Top-k Sampling

Vocabulary sizes can exceed:

  • 50,000
  • 100,000
  • even larger

Most tokens are irrelevant at any given step.

Top-k sampling keeps only the K most likely tokens.

Everything else becomes probability 0.


Example

Before top-k:

TokenProbability
mat0.50
floor0.20
table0.10
elephant0.0001

If:

top-k = 3

only:

  • mat
  • floor
  • table

remain selectable.


Top-p Sampling (Nucleus Sampling)

Top-p is more adaptive.

Instead of fixed number of tokens, the model keeps the smallest set whose cumulative probability exceeds:

Example:

p = 0.9

Why Top-p Became Popular

Easy prompts:

  • small candidate set

Ambiguous prompts:

  • larger candidate set

This creates more natural generation behavior than fixed top-k.

Modern chat systems frequently use:

  • Temperature
  • Top-p
  • Repetition penalties

together.


Hallucinations

One of the most important realities of LLMs.


Critical Truth

LLMs optimize for:

plausible next-token prediction

NOT:

  • Truth
  • Grounded factual retrieval
  • Perfect reasoning

This is why hallucinations occur.


Example

Prompt:

"What did Napoleon say in 2022?"

The model may still confidently produce output because:

  • fluent continuation patterns exist

not because:

  • factual grounding exists.

Why Hallucinations Increase at Higher Temperature

Higher randomness:

  • Increases low-probability token selection
  • Weakens factual precision
  • Encourages creative continuation

This can:

  • Improve brainstorming
  • Worsen reliability

Context Conditioning

LLMs are extremely sensitive to context.

Small prompt changes can dramatically alter:

  • Probability distributions
  • Tone
  • Reasoning style
  • Formatting
  • Verbosity
  • Behavior

Example

Prompt A:

"Write Python code"

Prompt B:

"You are a senior Python engineer. Write production-grade code."

The second prompt shifts:

  • Output distributions
  • Structure expectations
  • Stylistic patterns

This is why prompt engineering matters.


Why System Prompts Matter So Much

System prompts strongly bias:

  • Behavior priors
  • Assistant personality
  • Formatting rules
  • Safety behavior
  • Response style

They shape generation before user input even arrives.


Beam Search

Another generation strategy.

Mostly used in:

  • Translation
  • Structured generation
  • Seq2seq systems

instead of chat models.

Beam search explores multiple candidate continuations simultaneously and keeps highest-scoring sequences.


Why Beam Search Is Less Common in Chatbots

Beam search often produces:

  • Repetitive outputs
  • Generic phrasing
  • Lower creativity

Modern chat systems usually prefer sampling-based approaches.


Deterministic vs Probabilistic Generation

Even temperature 0 is not perfectly deterministic in practice.

Tiny differences in:

  • Floating point math
  • GPU execution
  • Implementation details

can still slightly alter outputs.


The Bigger Picture

Generation is fundamentally controlled probabilistic sampling

The art of LLM engineering largely involves shaping probability distributions effectively.


Why GPT Scaled So Well

Autoregressive Transformers:

  • Parallelized training efficiently
  • Scaled with data and compute
  • Learned rich statistical structures
  • Generalized surprisingly well

This eventually led to:

  • GPT-3
  • GPT-4
  • Claude
  • Gemini
  • modern frontier models

One Massive Practical Problem Still Remains

Autoregressive generation has a computational challenge.

As conversations become longer:

  • Attention computation becomes expensive
  • Inference slows down
  • Memory usage grows rapidly

How do production LLMs remain efficient while generating long outputs?

That is where:

  • KV cache
  • Inference optimization
  • Memory reuse

enter the picture.

We’ll unpack that in the next post.


Final Thought

GPT-style generation is fundamentally iterative probabilistic sequence continuation.

But when scaled across:

  • Massive datasets
  • Enormous parameter counts
  • Deep Transformer architectures

surprisingly sophisticated behaviors emerge.

That combination became one of the most important breakthroughs in modern AI.


Next

The Full Transformer Block: Residuals, FFNs, and LayerNorm