Implementing a Tiny Transformer in PyTorch
Part 11 of the Attention & Transformers Deep Dive Series
Introduction
In the previous post, we conceptually assembled a miniature GPT architecture.
We walked through:
- Tokenization
- Embeddings
- Positional encoding
- Self-attention
- Transformer blocks
- Autoregressive generation
But conceptual understanding and implementation understanding are different levels of mastery.
This post bridges that gap.
We are going to walk through:
- How a tiny Transformer is implemented in PyTorch
- What each major component looks like in code
- How tensors flow through the network
- How attention becomes actual matrix operations
This is NOT production-grade GPT engineering.
The goal is:
- Architectural clarity
- Implementation intuition
- Understanding tensor flow
By the end, Transformer code should feel dramatically less intimidating.
The Big Picture
A minimal GPT implementation usually contains:
Tokenizer
↓
Embedding Layer
↓
Positional Embeddings
↓
Transformer Blocks
├── Attention
├── FFN
├── Residuals
└── LayerNorm
↓
Output Projection
↓
SoftmaxThe implementation maps surprisingly cleanly to the architecture diagrams we studied earlier.
Step 1 — Imports
Typical minimal setup:
import torch
import torch.nn as nn
import torch.nn.functional as FPyTorch provides:
- Tensors
- Automatic differentiation
- GPU acceleration
- Neural network modules
Step 2 — Hyperparameters
A tiny GPT usually starts with small dimensions.
vocab_size = 5000
embed_dim = 128
num_heads = 4
num_layers = 4
context_length = 128These are intentionally tiny compared to frontier models.
For comparison:
| Model | Hidden Size |
|---|---|
| Tiny GPT | 128 |
| GPT-2 | 768 |
| GPT-4-class systems | thousands |
Step 3 — Token Embeddings
Embedding layer:
self.token_embedding = nn.Embedding(
vocab_size,
embed_dim
)This creates a learnable lookup table.
Input:
[10, 25, 81]Output:
(3 × embed_dim)tensor.
Step 4 — Positional Embeddings
Transformers need sequence awareness.
Simplified learned positional embeddings:
self.position_embedding = nn.Embedding(
context_length,
embed_dim
)During forward pass:
positions = torch.arange(seq_len)Then:
x = token_embeddings + position_embeddingsNow each token contains:
- Semantic meaning
- Positional information
simultaneously.
Step 5 — Understanding Tensor Shapes
This is one of the most important Transformer skills.
Suppose:
- batch size = 32
- sequence length = 128
- embedding dim = 128
Tensor shape becomes:
(32, 128, 128)Meaning:
(batch, sequence, embedding)Understanding these dimensions is critical for debugging Transformer code.
Step 6 — Q/K/V Projections
Attention begins by projecting embeddings into:
- Queries
- Keys
- Values
Typical implementation:
self.query = nn.Linear(embed_dim, embed_dim)
self.key = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)Forward pass:
Q = self.query(x)
K = self.key(x)
V = self.value(x)Now every token has Q/K/V representations.
Step 7 — Attention Scores
Attention similarity computation:
scores = Q @ K.transpose(-2, -1)This computes:
Result shape:
(batch, seq_len, seq_len)Each entry represents token-to-token attention relevance.
Step 8 — Scaling
Transformer scaling step:
scores = scores / (K.size(-1) ** 0.5)Equivalent to:
This stabilizes:
- Gradients
- Softmax behavior
- Training dynamics
Step 9 — Causal Masking
GPT must hide future tokens.
Typical mask:
mask = torch.tril(torch.ones(seq_len, seq_len))Example:
1 0 0
1 1 0
1 1 1Apply mask:
scores = scores.masked_fill(mask == 0, float('-inf'))Future positions now receive zero attention probability after softmax.
Step 10 — Softmax
Attention probabilities:
weights = F.softmax(scores, dim=-1)Now attention weights sum to 1 for every token.
Step 11 — Weighted Aggregation
Final attention output:
attention_output = weights @ VThis retrieves weighted contextual information.
Now each token becomes context-aware.
Step 12 — Multi-Head Attention
Instead of one attention mechanism:
Transformers split embeddings across multiple heads.
Typical reshape:
x.view(batch, seq_len, num_heads, head_dim)Then transpose:
x.transpose(1, 2)Result:
(batch, heads, seq_len, head_dim)Each head now learns different semantic relationships.
Step 13 — Feed Forward Network
Typical FFN:
self.ffn = nn.Sequential(
nn.Linear(embed_dim, 4 * embed_dim),
nn.GELU(),
nn.Linear(4 * embed_dim, embed_dim)
)Why expand first?
Because wider hidden spaces improve representational capacity.
Step 14 — Residual Connections
Residuals stabilize deep learning.
Implementation:
x = x + attention_outputand later:
x = x + ffn_outputResiduals:
- Preserve information
- Improve gradient flow
- Enable deep scaling
Step 15 — Layer Normalization
Typical implementation:
self.ln = nn.LayerNorm(embed_dim)Applied as:
x = self.ln(x)This stabilizes:
- Activations
- Gradients
- Optimization
Step 16 — Transformer Block
A minimal Transformer block:
class Block(nn.Module):
def __init__(self):
super().__init__()
self.attn = Attention()
self.ffn = FFN()
self.ln1 = nn.LayerNorm(embed_dim)
self.ln2 = nn.LayerNorm(embed_dim)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.ffn(self.ln2(x))
return xThis compact structure powers modern LLMs.
Step 17 — Stacking Blocks
GPT stacks many Transformer blocks:
self.blocks = nn.Sequential(
*[Block() for _ in range(num_layers)]
)Depth enables:
- Progressively richer representations
- Hierarchical abstraction building
- Complex reasoning patterns
Step 18 — Final Output Layer
Eventually hidden states project into vocabulary logits.
Implementation:
self.lm_head = nn.Linear(embed_dim, vocab_size)Output shape:
(batch, seq_len, vocab_size)Each token position now predicts next-token probabilities.
Step 19 — Cross Entropy Loss
Training objective:
loss = F.cross_entropy(
logits.view(-1, vocab_size),
targets.view(-1)
)This teaches next-token prediction.
Step 20 — Generation Loop
Inference becomes iterative.
Simplified generation:
for _ in range(max_tokens):
logits = model(tokens)
next_token = sample(logits)
tokens.append(next_token)This loop powers GPT-style generation.
Why Tensor Shapes Matter So Much
Most Transformer debugging problems involve:
- Incorrect dimensions
- Reshaping mistakes
- Masking issues
- Head-splitting errors
Understanding tensor flow is one of the biggest implementation skills.
What Production Systems Add
Real-world systems add:
- KV cache
- Flash Attention
- Mixed precision
- Distributed training
- Tensor parallelism
- Quantization
- Speculative decoding
Production Transformer engineering becomes significantly more complex.
Why Building Tiny Transformers Is Valuable
Implementing even a tiny GPT teaches:
- Architecture flow
- Tensor reasoning
- Attention mechanics
- Optimization bottlenecks
- Inference behavior
This dramatically improves:
- Research comprehension
- Debugging ability
- Systems intuition
One Important Realization
Modern LLMs ultimately reduce to:
- Matrix operations
- Probabilistic optimization
- Representation learning
The sophistication emerges from:
- Scale
- Optimization
- Training
- Systems engineering
not hidden symbolic magic.
Suggested Next Steps
After implementing a tiny Transformer, strong next projects include:
Practical Projects
- Character-level GPT
- Toy chatbot
- Attention visualization tool
- Mini RAG pipeline
- Prompt experimentation
Advanced Topics
- LoRA fine-tuning
- Quantization
- Distributed training
- Inference serving
- Flash Attention internals
Final Thought
One of the most powerful moments in learning Transformers is realizing:
the architecture is elegant underneath the complexity.
The intimidating diagrams eventually collapse into:
- Embeddings
- Matrix multiplications
- Normalization
- Attention
- Probabilistic generation
repeated at massive scale. And that combination became the foundation of modern AI.
Critical Mental Model
Transformer implementations are fundamentally:
large-scale tensor manipulation systems performing iterative contextual representation refinement.