From the course: Hands-On AI: Building LLM-Powered Apps

Language models and tokenization - Python Tutorial

From the course: Hands-On AI: Building LLM-Powered Apps

Language models and tokenization

- [Instructor] Before we get our hands dirty building applications, let's first establish a ground level understanding of what are large language models. But before we go large, let's visit the idea of language models and tokenization. Language model is a probability distribution over a sequence of words. In plain English, it means the language model takes a context and predicts the next word in the sequence. So as an example, when we ask ChatGPT a question with context of, "Shaping tomorrow," the language model behind it will predict the next word, with, and it will concatenate the predicted word and send in, "Shaping tomorrow with," to the language model again, and we will get back the word AI so we can concatenate them together again and send them to the large language model and get back the word applications, and we repeat the same step and get back the word called endoftext. This concludes the sentence generation process, and we end up with a sentence, "Shaping tomorrow with AI applications." You might wonder what endoftext is. It is a special token that denotes the conclusion of text generation. So this brings us to tokenization or how we chunk up text for language models to understand. Tokenization means we split text into chunks of tokens and map each token into a number so we can process them using computers. OpenAI chose to use a very efficient algorithm called Byte-Pair Encoding encoding to tokenize text. It was first invented in the '90s as text compression algorithm that used a count-based method as heuristics. Due to Byte-Pair Encoding, the tokenization results might not make linguistic sense. So to see tokenizers in action, let's go to platform.openai.com/tokenizer. As an example of tokenization, let's check the word translation. The word translation is tokenized as one single token, translation. The reason is likely that the word translation has appeared frequent enough in the training text to warrant its own token. This is very different than splitting up the word translation into trans, lat, and ion, so sometimes this may be counterintuitive. Another example is the word Samantha. It is tokenized into three separate tokens, S, aman, and tha. This is, again, the byproduct of the tokenization algorithm. Either the word Samantha did not happen frequently enough in the training text or the sub-word aman happens with enough frequency to break up Samantha into three separate tokens. And going back to our special endoftext token, there are a bunch of very special token that OpenAI uses to process text. For example, three pound signs or four pound signs are both equally as useful to deliminate text in the prompts. This is important to know, because OpenAI charges fee by token consumption. The rule of thumb is it takes roughly 30% more tokens than words. So if we have 100 words in the prompt, we should expect 130 tokens. To get to the most accurate count, we can use the tiktocken package. With this information, we will go through what a large language model is in the next video.

Contents