From the course: Generative AI: Working with Large Language Models
Transformers: History
- [Instructor] The models based on the original transformer paper from 2017 have evolved over the years. One of the challenges with training large language models in 2017 was that you needed labeled data. This required a lot of time and effort. The ULMFiT model proposed by Jeremy Howard and Sebastian Ruda provided a framework where you didn't need labeled data. And this meant large corpus of text, such as Wikipedia, could now be used to train models. In June of 2018, GPT or Generative Pre-Trained Model, which is developed by Open AI, was the first pre-trained transformer model. Now, this was used for fine tuning on various NLP tasks and obtained state-of-the-art results. And a couple of months later, researchers at Google came up with BERT or Bidirectional Encoder Representations from Transformers. We saw a couple of examples of BERT being used in production at Google. In February, 2019, Open AI released a bigger and better version of GPT called GPT-2. This made headlines because the Open AI team didn't want to release the details of the model because of ethical concerns. Now, later that year, Facebook's AI research team released BART. Google released T5. Both of these models are large pre-trained models using the same architecture as the original transformer. And at the same time the team at Hugging Face bucked the trend. Everyone was moving to bigger models, they released DistilBERT, which is a smaller, faster and lighter version of BERT, and had 95% of BERT'S performance and reduced the size of the BERT model by 40%. In May, 2020, Open AI released the third revision of their GPT models, GPT-3, which is excellent at generating high-quality English sentences. Now, although Open AI provided a lot of details in their GPT-3 paper, they didn't release the dataset they used or their model weights. So EleutherAI, which is a group of volunteer researchers focused on the open source release of language models and the data sets they used to train them. They released GPT-Neo, which has 2 billion parameters in March of 2021, GPT-J, which has 6 billion parameters a couple of months later and GPT-NeoX, which has 20 billion parameters in Feb of 2022. Now, this graph shows you the years on the x-axis and the number of parameters on the y-axis. Now, because the graph almost looks like a straight line you might think that the number of parameters has increased linearly over the years. The number of parameters in billions is a log scale on the y-axis. So the scale increases by 10 times each time you move up one unit. So BERT has around 110 million parameters. BERT Large has 340 million parameters and the largest GPT-2 model has 1.5 billion parameters. The biggest GPT-3 model that Open AI created has 175 billion parameters. And as you can see, over the years the trend has been for the language models to get larger.