Continuing the series of non-technical explainers, let's figure out how ChatGPT (and in general Large Language Models, or LLMs) work. This is part one of the ChatGPT explainer, with part two coming soon.

It's helpful but not necessary to read the non-technical explanation of AI first, if you feel like it read that and come back.

Fill In The Blank

Let's play a quick game of fill in the blank:

to be or not to _____

Why did you immediately think of the word be to fill in that blank, instead of the word banana or fish? Because you've seen the phrase many times and your brain has learned the most likely next word is be.

Let's do another one:

rock and ____

Did you think of the word roll? Why? Because that's the word you see most often following the words rock and.

How about this one:

I am very _____

This is less clear - the next word depends on the context.

I just ran a marathon. I am very _____

Perhaps now you would say tired. Or happy, or proud.

Your brain is picking the most likely next word based on the context of the sentence and based on what words it's seen most frequently in that context.

Aliens and CatGPT

In an astonishing turn of events a race of alien cats have landed on earth. Since you are the world's foremost cat expert you have been chosen to communicate with them.

Your boss walks in and drops a massive print out of all the alien cat communications on your desk.

"Speak cat!" she commands.

Unfortunately you don't speak alien cat.

You page through the print outs - it looks like pages and pages of gibberish.

zoog zeeg zag. zoog zeeg kaz. zoog zeeg bah. rag zoog zeeg. rag zoog ko. kaz rag. zap zoog zeeg. ...

Markov to the Rescue

What to do? You remember your friend Markov who's always talking about languages, words, and their relationships. Maybe he can help. You show him the alien cat communications and ask him if he can help you say something in cat.

"Yes!" he exclaims - "We can do this. We will create our response one word at a time, just by picking the best word to say next, and keep going until we have a sentence"

"But how do we know what the best word to say next is? Don't we need to know what the words mean?"

"Nope, we just need to know what word to say next"

"That... doesn't seem like it would work"

"Let me show you", says Markov, and grabs the book on his desk, Alice in Wonderland.

"We just need to know what word to say next. The best word to say next is simply the word that shows up most frequently after the word we're looking at. All we have to do is make a table of how often each word follows another word. We'll call this a frequency table"

He shows you how to create the frequency table - you just note down how many times each word follows another. It takes a while to go through all Alice in Wonderland and do the counting. You end up with:

Next word after "Alice" → was: 17 times,  and: 16 times,  thought: 12 times,  had: 11 times, ...
Next word after "sat" → down: 9 times,  silent: 2 times,  still: 2 times,  for: 1 times, ...
Next word after "was" → a: 31 times,  the: 20 times,  not: 12 times,  going: 11 times, ...
Next word after "and" → the: 80 times,  she: 52 times,  then: 31 times,  was: 19 times, ...
Next word after "a" → little: 59 times,  very: 25 times,  large: 20 times,  great: 17 times
Next word after "very" → much: 10 times,  soon: 7 times,  curious: 6 times,  glad: 5 times

This table gives you a lot of help in forming sentences in the style of Alice in Wonderland. For example, if you start with the word Alice, then you'd look that up in the above table and see that the most frequent next word is was. And after was you would select a. After a you would get little. You'd end up with Alice was a little.

How well does this work? Let's create some sentences:

the door, when i beg your verdict," it was quite plainly through the bottle, i'm afraid that they had its nest.

You can try this out for yourself on almeopedia.

That's not great. Why is it so nonsensical? Think back to our fill in the blank examples at the start of this post: in order to pick a good next word you need context. If someone asked you what word should appear after rock, you'd have a hard time picking something reasonable, but if they gave you rock and, you'd fairly quickly think of roll.

What would happen if you considered two words instead of a single word for your context? For one thing your frequency table creation would become harder - now instead of a single line for each word, you'd have a line for each word pair. You'd have to gather stats for all the two word permutations, which is a lot more than the single word case.

Next word after "Alice was" → not: 3 times,  beginning: 2 times,  very: 2 times, ...
Next word after "was not" → a: 3 times,  here: 1 times,  going: 1 times ...
Next word after "not here" → before: 1 times
Next word after "not a" → moment: 2 times,  bit: 2 times,  serpent: 2 times

Let's create a sentence with this and see how it looks:

the hatter, and, burning with curiosity, she decided on going into the garden.

You can try this out for yourself on almeopedia.

That's looking better, it almost makes sense. Let's keep going - instead of two words of context, how about three?

The Hatter’s remark seemed to have no sort of chance of her ever getting out of the water, and seemed to quiver all over with diamonds, and walked two and two, as the soldiers did.


So she swallowed one of the cakes, and was delighted to find that she knew the name of nearly everything there.

Hmm. This is a little too good. It turns out a very similar sentence exists in the original Alice in Wonderland text:

So she swallowed one of the cakes, and was delighted to find that she began shrinking directly.

Our simple method of picking the most likely next word can result in the system memorizing the text snippets - given long enough context, the next most likely word is exactly the word that appeared in the original text following that context.

We can fix this by introducing some randomness - instead of always picking the most likely next word, we can pick somewhat likely next words. This way we're less likely to regurgitate the original text.

The good news is our method seems to work for English, so it'll probably work for alien cat language as well.

Aside: You might run into the term "stochastic" as you look into language models - this just means randomly determined. For example, you might hear people argue whether these systems are Stochastic Parrots, implying the systems are simply parroting back the original text that they saw, with some randomness thrown in.

Large Language Models

ChatGPT and many other Large Language Models (LLMs) do essentially what you did above: they examine a very large amount of human communication, gather stats (or probabilities) on what words are most likely to follow other words, and play a continous game of fill-in-the-blank. You give them some context (your prompt or question), and they create the reply one word at a time by selecting the word most likely to appear next. They respond with a word, then look at their internal stats to pick a word to follow their first word, then a word to follow their second word, and so forth, one word at a time, until they've formed a response.

To be Continued...

In part two of this series we'll look at the problems you'll run into with this method and how deep learning helps you overcome those problems.

Further Reading

If you're interested in this topic you might also enjoy the other posts in the Non-Technical Explainer series as well.