From the course: Hands-On AI: Building LLM-Powered Apps
Language models and tokenization - Python Tutorial
From the course: Hands-On AI: Building LLM-Powered Apps
Language models and tokenization
- [Instructor] Before we get our hands dirty building applications, let's first establish a ground level understanding of what are large language models. But before we go large, let's visit the idea of language models and tokenization. Language model is a probability distribution over a sequence of words. In plain English, it means the language model takes a context and predicts the next word in the sequence. So as an example, when we ask ChatGPT a question with context of, "Shaping tomorrow," the language model behind it will predict the next word, with, and it will concatenate the predicted word and send in, "Shaping tomorrow with," to the language model again, and we will get back the word AI so we can concatenate them together again and send them to the large language model and get back the word applications, and we repeat the same step and get back the word called endoftext. This concludes the sentence generation process, and we end up with a sentence, "Shaping tomorrow with AI applications." You might wonder what endoftext is. It is a special token that denotes the conclusion of text generation. So this brings us to tokenization or how we chunk up text for language models to understand. Tokenization means we split text into chunks of tokens and map each token into a number so we can process them using computers. OpenAI chose to use a very efficient algorithm called Byte-Pair Encoding encoding to tokenize text. It was first invented in the '90s as text compression algorithm that used a count-based method as heuristics. Due to Byte-Pair Encoding, the tokenization results might not make linguistic sense. So to see tokenizers in action, let's go to platform.openai.com/tokenizer. As an example of tokenization, let's check the word translation. The word translation is tokenized as one single token, translation. The reason is likely that the word translation has appeared frequent enough in the training text to warrant its own token. This is very different than splitting up the word translation into trans, lat, and ion, so sometimes this may be counterintuitive. Another example is the word Samantha. It is tokenized into three separate tokens, S, aman, and tha. This is, again, the byproduct of the tokenization algorithm. Either the word Samantha did not happen frequently enough in the training text or the sub-word aman happens with enough frequency to break up Samantha into three separate tokens. And going back to our special endoftext token, there are a bunch of very special token that OpenAI uses to process text. For example, three pound signs or four pound signs are both equally as useful to deliminate text in the prompts. This is important to know, because OpenAI charges fee by token consumption. The rule of thumb is it takes roughly 30% more tokens than words. So if we have 100 words in the prompt, we should expect 130 tokens. To get to the most accurate count, we can use the tiktocken package. With this information, we will go through what a large language model is in the next video.
Contents
-
-
-
Language models and tokenization4m 53s
-
Large language model capabilities1m 48s
-
Challenge: Introduction to Chainlit2m 28s
-
Solution: Introduction to Chainlit solution1m 18s
-
Prompts and prompt templates3m
-
Obtaining an OpenAI token1m 20s
-
Challenge: Adding an LLM to the Chainlit app1m 31s
-
Solution: Adding an LLM to the Chainlit app3m 20s
-
Large language model limitations3m 43s
-
-
-
-