From the course: AI Text Summarization with Hugging Face

Understanding tokenizers - Hugging Face Tutorial

From the course: AI Text Summarization with Hugging Face

Understanding tokenizers

Previously we generated summaries using the T5 model with zero-shot learning, that is, we didn't fine-tune the model before using it to generate summaries. We just used the pre-trained model directly. We'll now continue working with the T5 model, except that, we'll fine-tune the model on this CNN dataset that we have. Fine-tuning is an approach to transfer learning where we start with the weights of the pre-trained model and then train it on the dataset that we have. We'll see if the fine-tuned model produces better summaries than just using the pre-trained model. In order to be able to fine-tune this model, we need to understand in a little more detail the steps involved when we run a Hugging Face pipeline. The steps include first preparing and processing the inputs that you parse in, then running the model on those inputs to get predictions and then processing these outputs, these predictions so that they return to you in a form that you can understand. When we fine-tune our model, we'll have to perform each of these steps individually, starting with the first step, tokenization. This is where we pre-process and prepare the input text so that it's converted to a form that can be fed into our model for predictions. We access the tokenizer for our T5 Small model using the AutoTokenizer class. AutoTokenizer.from_pretrained and by parsing in the model name will give you the tokenizer for your model. Every NLP model in the Hugging Face library will have a different tokenizer. This will give us the tokenizer for our model. In order to see what kind of data we feed into our model, let's apply this tokenizer to this text, four-time defending champion. Observe that the result is in the form of input IDs. The tokenizer will break the input sentence into sub-words. For example, four-time will become two sub-words, defending might become two sub-words as well. And every sub-word has a unique ID. Those are the input IDs that you see. The attention mask tells the model what tokens to pay attention to. For example, if you had some text and then you add a bunch of padding after that text, well, the attention mask would tell the model to ignore the padding by having all 0s for the padded tokens. When you invoke the tokenizer in this form, it actually performs two separate operations. The first is tokenizer.tokenize, and this is what generates the sub-words or tokens from the input text. You can see four is a separate token, the hyphen is a separate token, then time, and that's why we get the six tokens. The second operation is to convert these tokens to IDs. So you can call tokenizer.convert_tokens_to_ids, parse in the input tokens, and we get the exact same result that we got earlier. These are the same tokens we got for the input text when we just invoked the tokenizer directly on the text.

Contents