How LLMs Work 5 min read

What Is Attention in AI?

You just learned that Transformers use something called “attention” to understand which parts of a sentence matter most. But what does that actually mean?

Think of it like highlighting while you read

When you read a complex sentence, you do not give every word equal weight. Some words are more important for understanding the meaning than others.

Consider: “The manager told the employee that she needed to submit the report before she left.”

Who does “she” refer to? You automatically focus on the nearby nouns and the structure of the sentence to figure that out. You highlight the relevant parts mentally.

Attention in AI works similarly. For each token the model is processing, it asks: which other tokens in this sequence should I focus on most?

How it works (without the maths)

Imagine every word in a sentence holding up a small sign saying what it is about. Attention lets each word scan all the other signs and decide how much to pay attention to each one.

For the word “it” in “The cat sat on the mat because it was comfortable,” attention would assign high weight to “mat” (what “it” refers to) and lower weight to words like “because” or “sat.”

These weights are learned during training. The model does not have rules programmed in. It figures out which connections matter by seeing billions of examples of language.

Attention: which words matter most?

The cat sat on the mat because it was comfortable

itmathigh attention
itcatmedium attention
itsatlow attention

Multi-head attention

Modern Transformers use something called multi-head attention. Instead of one set of attention weights, the model runs several attention processes in parallel, each one looking at the text from a different angle.

One head might focus on grammatical relationships. Another might track what pronouns refer to. Another might pick up on topic continuity.

The outputs are then combined. This gives the model a richer, more nuanced understanding of the text than any single view could provide.

Why this matters

Attention is what lets language models:

  • Resolve pronouns correctly across long sentences
  • Keep track of topics across paragraphs
  • Understand that “Apple” means the company in a business article and the fruit in a recipe
  • Follow complex instructions that refer back to earlier parts of a prompt

Before attention, models struggled with all of these. Attention is the core reason Transformers outperformed everything that came before.

You now understand this

Attention is the mechanism that lets a Transformer figure out which parts of the input are most relevant to each other. For every token it processes, the model learns to assign importance weights to every other token. This lets it resolve ambiguous references, track topics, and understand context across long stretches of text. Multi-head attention runs several of these processes at once for a richer understanding.