A jargon-free explanation of how AI large language models work

✏️ Highlights

no one on Earth fully understands the inner workings of LLMs. Researchers are working to gain a better understanding, but this is a slow process that will take years—perhaps decades—to complete.

unusual way these systems were developed. Conventional software is created by human programmers, who give computers explicit, step-by-step instructions. By contrast, ChatGPT is built on a neural network that was trained using billions of words of ordinary language.

you’ve probably heard that LLMs are trained to “predict the next word” and that they require huge amounts of text to do this. But that tends to be where the explanation stops.

We’ll aim to explain what’s known about the inner workings of these models without resorting to technical jargon or advanced math.

word vectors, the surprising way language models represent and reason about language. Then we’ll dive deep into the transformer,

word vectors, the surprising way language models represent and reason about language.

Language models use a long list of numbers called a “word vector.” For example, here’s one way to represent cat as a vector: [0.0074, 0.0030, –0.0105, 0.0742, 0.0765, –0.0011, 0.0265, 0.0106, 0.0191, 0.0038, –0.0468, –0.0212, 0.0091, 0.0030, –0.0563, –0.0396, –0.0998, –0.0796, …, 0.0002] (The full vector is 300 numbers long—to see it all, click here and then click “show the raw vector.”)

Humans represent English words with a sequence of letters, like C-A-T for “cat.”

Here’s an analogy. Washington, DC, is located at 38.9 degrees north and 77 degrees west. We can represent this using a vector notation: Washington, DC, is at [38.9, 77] New York is at [40.7, 74] London is at [51.5, 0.1] Paris is at [48.9, –2.4] This is useful for reasoning about spatial relationships. You can tell New York is close to Washington, DC, because 38.9 is close to 40.7 and 77 is close to 74.

no one on Earth fully understands the inner workings of LLMs. Researchers are working to gain a better understanding, but this is a slow process that will take years—perhaps decades—to complete.

unusual way these systems were developed. Conventional software is created by human programmers, who give computers explicit, step-by-step instructions. By contrast, ChatGPT is built on a neural network that was trained using billions of words of ordinary language.

you’ve probably heard that LLMs are trained to “predict the next word” and that they require huge amounts of text to do this. But that tends to be where the explanation stops.

We’ll aim to explain what’s known about the inner workings of these models without resorting to technical jargon or advanced math.

word vectors, the surprising way language models represent and reason about language. Then we’ll dive deep into the transformer,

word vectors, the surprising way language models represent and reason about language.

Language models use a long list of numbers called a “word vector.” For example, here’s one way to represent cat as a vector: [0.0074, 0.0030, –0.0105, 0.0742, 0.0765, –0.0011, 0.0265, 0.0106, 0.0191, 0.0038, –0.0468, –0.0212, 0.0091, 0.0030, –0.0563, –0.0396, –0.0998, –0.0796, …, 0.0002] (The full vector is 300 numbers long—to see it all, click here and then click “show the raw vector.”)

Humans represent English words with a sequence of letters, like C-A-T for “cat.”

Here’s an analogy. Washington, DC, is located at 38.9 degrees north and 77 degrees west. We can represent this using a vector notation: Washington, DC, is at [38.9, 77] New York is at [40.7, 74] London is at [51.5, 0.1] Paris is at [48.9, –2.4] This is useful for reasoning about spatial relationships. You can tell New York is close to Washington, DC, because 38.9 is close to 40.7 and 77 is close to 74.