Skip to main content
Back to IntelligenceArtificial Intelligence

What Is a Large Language Model?

A clear explanation of how large language models work — from tokens and transformers to training and inference — without the hype.

E
Explicor
5 min read

Large language models (LLMs) are the technology behind systems like ChatGPT, Claude, and Gemini. Understanding what they actually are — rather than what they are marketed as — requires looking at a few key concepts: how text is represented, how models learn from data, and what happens when you ask one a question.

What is a language model?

A language model is a system that assigns probabilities to sequences of text. Given the words "The cat sat on the", a language model can estimate how likely different words are to come next: "mat" might be very probable, "quantum" much less so.

This is not knowledge in any deep sense. It is pattern matching across vast amounts of text. But it turns out that doing this at scale, with enough data and a powerful enough architecture, produces systems that can do a surprising range of useful things.

Tokens, not words

LLMs do not process words directly. They process tokens — chunks of text that may be whole words, parts of words, or punctuation. The word "running" might be one token; "unprecedented" might be split into "un", "prece", "dented". This matters because it means LLMs work at a sub-word level, which helps them handle new words and multiple languages.

A typical LLM might have a vocabulary of 50,000–150,000 tokens. The model learns a numerical representation (called an embedding) for each token, capturing its relationship to other tokens in a high-dimensional space.

The transformer architecture

Modern LLMs are built on the transformer architecture, introduced in the 2017 paper "Attention Is All You Need." The key innovation was the attention mechanism: rather than processing text sequentially, transformers can look at all tokens in a context simultaneously and learn which tokens are most relevant to each other.

When the model processes a sentence like "The bank by the river is muddy", attention allows it to connect "bank" with "river" and "muddy" — disambiguating the meaning that would be ambiguous without context.

Transformers consist of many layers, each performing this attention operation and passing results forward. Large models have hundreds of layers and billions of parameters — numerical weights that encode everything the model has learned.

Training: learning from text

Training a large language model is conceptually simple: show the model text, ask it to predict the next token, compare its prediction to the actual token, and adjust the parameters to reduce the error. Repeat this billions of times on trillions of tokens.

This process — called supervised learning on the next-token prediction task — is why LLMs are good at generating coherent text. They have been trained to do exactly this, at massive scale.

The training data typically includes books, websites, code repositories, scientific papers, and other text corpora. The choice of training data significantly shapes what the model knows and how it behaves.

Inference: generating text

When you ask an LLM a question, it generates a response token by token. At each step, it calculates a probability distribution over all possible next tokens, selects one (using some sampling strategy), and adds it to the context before predicting the next token.

This is why LLMs can occasionally produce plausible-sounding but incorrect information — they are generating statistically likely sequences of tokens, not retrieving facts from a database. The model has no separate memory of "true" and "false" — it has weights that encode patterns from training data.

What LLMs are not

It is worth being precise about what LLMs are not:

  • They are not reasoning engines in any formal sense. They can produce reasoning-like outputs, but this is pattern matching, not logical deduction.
  • They do not have access to real-time information (unless given tools to retrieve it).
  • They do not "know" things the way humans know things. They encode statistical patterns from text.

This does not make them useless — quite the opposite. But understanding their actual nature helps set appropriate expectations for what they can and cannot do reliably.

Scale and capability

One of the most striking findings in LLM research is that capability improves predictably with scale — more parameters, more training data, and more compute tend to produce better models. This empirical relationship, described by "scaling laws," has driven the rapid increase in model size over the past several years.

Capabilities that seemed far off a few years ago — coherent long-form writing, code generation, complex question answering — have emerged as models scaled up. This has led to genuine surprises about what statistical pattern matching at scale can achieve.

Summary

Large language models are neural networks, built on the transformer architecture, trained to predict the next token in a sequence. They work with tokenized text, learn statistical patterns from enormous datasets, and generate text by sampling from probability distributions. They are not reasoning systems or knowledge retrieval systems, but their scale gives them capabilities that overlap significantly with both.

More Intelligence

Artificial Intelligence

Prompt Engineering: A Practical Guide

How to write prompts that get reliable, useful outputs from large language models. Techniques backed by evidence, not folklore.

5 min