Implementing a Tiny Transformer in PyTorch

Part 11 of the Attention & Transformers Deep Dive Series


Introduction

In the previous post, we conceptually assembled a miniature GPT architecture.

We walked through:

  • Tokenization
  • Embeddings
  • Positional encoding
  • Self-attention
  • Transformer blocks
  • Autoregressive generation

But conceptual understanding and implementation understanding are different levels of mastery.

This post bridges that gap.

We are going to walk through:

  • How a tiny Transformer is implemented in PyTorch
  • What each major component looks like in code
  • How tensors flow through the network
  • How attention becomes actual matrix operations

This is NOT production-grade GPT engineering.

The goal is:

  • Architectural clarity
  • Implementation intuition
  • Understanding tensor flow

By the end, Transformer code should feel dramatically less intimidating.


The Big Picture

A minimal GPT implementation usually contains:

Tokenizer

Embedding Layer

Positional Embeddings

Transformer Blocks
        ├── Attention
        ├── FFN
        ├── Residuals
        └── LayerNorm

Output Projection

Softmax

The implementation maps surprisingly cleanly to the architecture diagrams we studied earlier.


Step 1 — Imports

Typical minimal setup:

import torch
import torch.nn as nn
import torch.nn.functional as F

PyTorch provides:

  • Tensors
  • Automatic differentiation
  • GPU acceleration
  • Neural network modules

Step 2 — Hyperparameters

A tiny GPT usually starts with small dimensions.

vocab_size = 5000
embed_dim = 128
num_heads = 4
num_layers = 4
context_length = 128

These are intentionally tiny compared to frontier models.

For comparison:

ModelHidden Size
Tiny GPT128
GPT-2768
GPT-4-class systemsthousands

Step 3 — Token Embeddings

Embedding layer:

self.token_embedding = nn.Embedding(
    vocab_size,
    embed_dim
)

This creates a learnable lookup table.

Input:

[10, 25, 81]

Output:

(3 × embed_dim)

tensor.


Step 4 — Positional Embeddings

Transformers need sequence awareness.

Simplified learned positional embeddings:

self.position_embedding = nn.Embedding(
    context_length,
    embed_dim
)

During forward pass:

positions = torch.arange(seq_len)

Then:

x = token_embeddings + position_embeddings

Now each token contains:

  • Semantic meaning
  • Positional information

simultaneously.


Step 5 — Understanding Tensor Shapes

This is one of the most important Transformer skills.

Suppose:

  • batch size = 32
  • sequence length = 128
  • embedding dim = 128

Tensor shape becomes:

(32, 128, 128)

Meaning:

(batch, sequence, embedding)

Understanding these dimensions is critical for debugging Transformer code.


Step 6 — Q/K/V Projections

Attention begins by projecting embeddings into:

  • Queries
  • Keys
  • Values

Typical implementation:

self.query = nn.Linear(embed_dim, embed_dim)
self.key = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)

Forward pass:

Q = self.query(x)
K = self.key(x)
V = self.value(x)

Now every token has Q/K/V representations.


Step 7 — Attention Scores

Attention similarity computation:

scores = Q @ K.transpose(-2, -1)

This computes:

Result shape:

(batch, seq_len, seq_len)

Each entry represents token-to-token attention relevance.


Step 8 — Scaling

Transformer scaling step:

scores = scores / (K.size(-1) ** 0.5)

Equivalent to:

This stabilizes:

  • Gradients
  • Softmax behavior
  • Training dynamics

Step 9 — Causal Masking

GPT must hide future tokens.

Typical mask:

mask = torch.tril(torch.ones(seq_len, seq_len))

Example:

1 0 0
1 1 0
1 1 1

Apply mask:

scores = scores.masked_fill(mask == 0, float('-inf'))

Future positions now receive zero attention probability after softmax.


Step 10 — Softmax

Attention probabilities:

weights = F.softmax(scores, dim=-1)

Now attention weights sum to 1 for every token.


Step 11 — Weighted Aggregation

Final attention output:

attention_output = weights @ V

This retrieves weighted contextual information.

Now each token becomes context-aware.


Step 12 — Multi-Head Attention

Instead of one attention mechanism:

Transformers split embeddings across multiple heads.

Typical reshape:

x.view(batch, seq_len, num_heads, head_dim)

Then transpose:

x.transpose(1, 2)

Result:

(batch, heads, seq_len, head_dim)

Each head now learns different semantic relationships.


Step 13 — Feed Forward Network

Typical FFN:

self.ffn = nn.Sequential(
    nn.Linear(embed_dim, 4 * embed_dim),
    nn.GELU(),
    nn.Linear(4 * embed_dim, embed_dim)
)

Why expand first?

Because wider hidden spaces improve representational capacity.


Step 14 — Residual Connections

Residuals stabilize deep learning.

Implementation:

x = x + attention_output

and later:

x = x + ffn_output

Residuals:

  • Preserve information
  • Improve gradient flow
  • Enable deep scaling

Step 15 — Layer Normalization

Typical implementation:

self.ln = nn.LayerNorm(embed_dim)

Applied as:

x = self.ln(x)

This stabilizes:

  • Activations
  • Gradients
  • Optimization

Step 16 — Transformer Block

A minimal Transformer block:

class Block(nn.Module):
 
    def __init__(self):
        super().__init__()
 
        self.attn = Attention()
        self.ffn = FFN()
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ln2 = nn.LayerNorm(embed_dim)
 
    def forward(self, x):
 
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
 
        return x

This compact structure powers modern LLMs.


Step 17 — Stacking Blocks

GPT stacks many Transformer blocks:

self.blocks = nn.Sequential(
    *[Block() for _ in range(num_layers)]
)

Depth enables:

  • Progressively richer representations
  • Hierarchical abstraction building
  • Complex reasoning patterns

Step 18 — Final Output Layer

Eventually hidden states project into vocabulary logits.

Implementation:

self.lm_head = nn.Linear(embed_dim, vocab_size)

Output shape:

(batch, seq_len, vocab_size)

Each token position now predicts next-token probabilities.


Step 19 — Cross Entropy Loss

Training objective:

loss = F.cross_entropy(
    logits.view(-1, vocab_size),
    targets.view(-1)
)

This teaches next-token prediction.


Step 20 — Generation Loop

Inference becomes iterative.

Simplified generation:

for _ in range(max_tokens):
 
    logits = model(tokens)
 
    next_token = sample(logits)
 
    tokens.append(next_token)

This loop powers GPT-style generation.


Why Tensor Shapes Matter So Much

Most Transformer debugging problems involve:

  • Incorrect dimensions
  • Reshaping mistakes
  • Masking issues
  • Head-splitting errors

Understanding tensor flow is one of the biggest implementation skills.


What Production Systems Add

Real-world systems add:

  • KV cache
  • Flash Attention
  • Mixed precision
  • Distributed training
  • Tensor parallelism
  • Quantization
  • Speculative decoding

Production Transformer engineering becomes significantly more complex.


Why Building Tiny Transformers Is Valuable

Implementing even a tiny GPT teaches:

  • Architecture flow
  • Tensor reasoning
  • Attention mechanics
  • Optimization bottlenecks
  • Inference behavior

This dramatically improves:

  • Research comprehension
  • Debugging ability
  • Systems intuition

One Important Realization

Modern LLMs ultimately reduce to:

  • Matrix operations
  • Probabilistic optimization
  • Representation learning

The sophistication emerges from:

  • Scale
  • Optimization
  • Training
  • Systems engineering

not hidden symbolic magic.


Suggested Next Steps

After implementing a tiny Transformer, strong next projects include:

Practical Projects

  • Character-level GPT
  • Toy chatbot
  • Attention visualization tool
  • Mini RAG pipeline
  • Prompt experimentation

Advanced Topics

  • LoRA fine-tuning
  • Quantization
  • Distributed training
  • Inference serving
  • Flash Attention internals

Final Thought

One of the most powerful moments in learning Transformers is realizing:

the architecture is elegant underneath the complexity.

The intimidating diagrams eventually collapse into:

  • Embeddings
  • Matrix multiplications
  • Normalization
  • Attention
  • Probabilistic generation

repeated at massive scale. And that combination became the foundation of modern AI.


Critical Mental Model

Transformer implementations are fundamentally:

large-scale tensor manipulation systems performing iterative contextual representation refinement.