The transformer architecture, introduced in 2017, is the foundation of virtually every major language model in use today. Understanding how it works — not just that it "uses attention" — provides useful intuition for understanding what language models can and cannot do.
The problem transformers solved
Before transformers, the dominant architecture for sequential data (text, audio, time series) was the recurrent neural network (RNN). RNNs process sequences one element at a time, maintaining a hidden state that carries information from previous elements.
This worked but had two significant problems:
Sequential processing: Because each step depends on the previous step, RNNs cannot be parallelized during training. Training on long sequences is slow.
Long-range dependencies: Information from early in a sequence degrades as it passes through many time steps. RNNs struggle to use context from 100 tokens ago when processing the current token.
Transformers solve both problems by abandoning sequential processing and using attention to relate every token to every other token in a single step.
Tokens and embeddings
Text is first converted to tokens — sub-word units from a learned vocabulary. Each token is mapped to a dense vector (embedding) that represents its meaning.
The input to a transformer is a matrix of embeddings — one row per token in the input sequence. All subsequent operations work on this matrix.
Self-attention: the core mechanism
The key innovation in transformers is self-attention, which allows every token to look at every other token and decide how much to "attend" to each one.
For each token, three vectors are computed:
- Query (Q): What this token is looking for
- Key (K): What this token contains (how it describes itself to others)
- Value (V): The information this token contributes when attended to
The attention score between two tokens is computed as the dot product of the first token's Query with the second token's Key, scaled by the square root of the dimension. Higher scores mean more attention.
These scores are softmaxed to produce weights that sum to 1, then multiplied by the Value vectors to produce a weighted sum. The result is a new representation of each token that incorporates information from the tokens most relevant to it.
In the sentence "The bank by the river flooded", the attention mechanism lets "flooded" attend strongly to "river", helping the model understand that "bank" here means riverbank, not financial institution.
Multi-head attention
Self-attention is performed multiple times in parallel, each with different learned Q, K, V projection matrices. These are called attention heads.
Each head can attend to different aspects of the relationships between tokens. One head might capture syntactic structure, another semantic relationships, another coreference. The outputs of all heads are concatenated and projected back to the original dimension.
The feed-forward sublayer
After each attention layer, each token's representation passes through a small feed-forward network. This network applies two linear transformations with a nonlinear activation in between.
The feed-forward layer processes each token independently — unlike attention, which mixes information across tokens. It is where most of the model's parameters are stored and is thought to be important for storing factual knowledge.
Layer normalization and residual connections
Each sublayer (attention and feed-forward) is wrapped with a residual connection (the input is added back to the output) and layer normalization. These techniques make training more stable and allow very deep networks to be trained.
Without residual connections, gradients during training would vanish or explode through hundreds of layers. With them, information from the input can flow directly to any layer.
Positional encoding
Self-attention treats all tokens symmetrically — without additional information, it cannot tell whether a token appears at position 3 or position 300. Positional encodings add position information to the token embeddings before processing.
The original transformer used fixed sinusoidal encodings. Most modern models use learned positional embeddings or relative positional encodings (RoPE, ALiBi) that generalize better to sequence lengths longer than seen during training.
The full stack
A transformer model consists of many identical layers, each containing a multi-head self-attention sublayer and a feed-forward sublayer. Large models have hundreds of layers.
The size of a model — its number of parameters — is determined primarily by:
- The hidden dimension (embedding size)
- The number of attention heads
- The number of layers
- The intermediate size of the feed-forward layers
GPT-3, for example, has 96 layers with a hidden dimension of 12,288.
Why transformers dominate
Transformers replaced RNNs for almost all sequence modeling tasks because:
- Parallelization: The attention computation over an entire sequence can be done in parallel on GPUs, enabling training on much larger datasets
- Long-range dependencies: Every token can directly attend to every other token, regardless of distance
- Scalability: The architecture scales predictably with model size and compute
The architecture has also extended beyond text — transformers (or variants of them) are now used for images, audio, video, proteins, and more.
Summary
Transformers use self-attention to compute relationships between all tokens simultaneously, overcoming the sequential limitations of RNNs. Multi-head attention lets different heads capture different types of relationships. Feed-forward layers process tokens independently and store factual knowledge. Residual connections and layer normalization enable training very deep networks. The architecture's parallelizability and scalability are the core reasons it now dominates AI.