Skip to content
Article Issue #5165

Transformer Architecture

What to know

Transformer Architecture is a neural network design introduced in the 2017 paper 'Attention Is All You Need' that processes sequences by computing attention weights between every pair of tokens in parallel; Each transformer layer contains a multi-head self-attention block and a feed-forward network; Builders rarely implement transformers directly; they consume pretrained checkpoints via APIs or frameworks like Hugging Face Transformers

Transformer Architecture, WikiWalls Glossary illustration

« Back to Glossary Index

Transformer Architecture is a neural network design introduced in the 2017 paper ‘Attention Is All You Need’ that processes sequences by computing attention weights between every pair of tokens in parallel. This design enabled much faster training than recurrent networks and scales efficiently with additional compute, making it the dominant architecture for large language models.

How it works

Each transformer layer contains a multi-head self-attention block and a feed-forward network. Self-attention lets every token attend to every other token in the context window, capturing syntax, coreference, and long-range dependencies in a single pass. Stacking many such layers produces increasingly abstract representations that the final layers use to predict output tokens.

Key facts

  • Origin paper: Vaswani et al., Google Brain, NeurIPS 2017.
  • Key operation: Scaled dot-product attention, computed as softmax(QKT / sqrt(d_k))V.
  • Variants: Encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) are the three main families.
  • Scaling: Transformer performance follows predictable scaling laws with respect to parameters, compute, and data.

For builders

Builders rarely implement transformers directly; they consume pretrained checkpoints via APIs or frameworks like Hugging Face Transformers. Understanding the architecture helps diagnose context window limits, attention cost, and why long-context inference is disproportionately expensive compared to short-context calls.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.