How does ChatGPT know when to stop?
You might’ve heard that LLMs (large language models) work by predicting the next word in a sentence.* That sounds great, but we know that chatbots produce paragraph-long responses, not just single words. How does it go from a single word to a paragraph, and why does it stop at two paragraphs instead of twenty?
The answer: autoregressive generation. The LLM produces the next word, then the initial prompt plus that new word is fed back into the model. This process keeps repeating, one next word at a time, until the model decides to stop. That decision depends on the model selecting the end-of-sequence token (EOS).
LLMs don’t choose words directly. The model has a huge list of all the possible words and punctuation marks it can choose from, including the end-of-sequence token. At each step, the model produces a likelihood score for every option on the list. Higher scores mean a word works well with the existing sentence. Every time the model is prompted, it selects the next word based on these likelihood scores. Sometimes it picks the highest-score option; sometimes it samples a little more randomly to keep from being repetitive. Whenever it picks the EOS, it stops writing.
Think of it like the most socially anxious machine in the world, constantly assessing whether it’s ok to keep talking. Sometimes machine learning engineers add extra rules, like: don’t select the end-of-sequence token until the answer is at least 20 words long. Together, these help LLMs produce adequately long answers to any prompt.
*Technical disclosure: LLMs work on a token basis, which can be a word, part of a word, punctuation, or a space. Everywhere it says “word” here can be replaced with “token.” There are lots of different ways to tokenize a sentence, but that’s for a later post.