A jargon-free explanation of how AI large language models work
Timothy B. Lee and Sean Trott writing for Ars Technical (Apple News)
Why use such a baroque notation? Here’s an analogy. Washington, DC, is located at 38.9 degrees north and 77 degrees west. We can represent this using a vector notation:
- Washington, DC, is at [38.9, 77]
- New York is at [40.7, 74]
- London is at [51.5, 0.1]
- Paris is at [48.9, -2.4]
This is useful for reasoning about spatial relationships. You can tell New York is close to Washington, DC, because 38.9 is close to 40.7 and 77 is close to 74. By the same token, Paris is close to London. But Paris is far from Washington, DC.
Language models take a similar approach: Each word vector represents a point in an imaginary “word space,” and words with more similar meanings are placed closer together (technically, LLMs operate on fragments of words called tokens, but we’ll ignore this implementation detail to keep this article a manageable length). For example, the words closest to cat in vector space include dog, kitten, and pet. A key advantage of representing words with vectors of real numbers (as opposed to a string of letters, like C-A-T) is that numbers enable operations that letters don’t.
This article is great for understanding how large language models work. The above especially put a lot of pieces into place for me.
If you have any interest in Machine Learning/AI/ChatGPT/LLMs, this is a must read.