From the course: Generative AI: Working with Large Language Models
Megatron-Turing NLG Model
- [Instructor] A lot of the research after GPT-3 was released seemed to indicate that scaling up models improved performance. So Microsoft and Nvidia partnered together to create the Megatron-Turing NLG model, a massive three times more parameters than GPT-3. Modelwise, the architecture uses the transformers decoder just like GPT-3, but you can see that it has more layers and more attention heads than GPT-3. So for example, GPT-3 has 96 layers while as Megatron-Turing's NLG has 105. GPT-3 has 96 attention heads, and Megatron-Turing's NLG model has 128 and finally, Megatron-Turing's NLG model has 530 billion parameters versus GPT-3's 175 billion. Now, the researchers identified a couple of challenges with working with large language models. It's hard to train big models because they don't fit in the memory of one GPU because it would take a long time to do all the compute operations required. Efficient parallel techniques, scalable on both memory and compute, can help to use the full potential of thousands of GPUs. Although these researchers achieved superior zero, one and few-shot learning accuracies on several NLP benchmarks and established some new state-of-the-art results, a lot of their success is probably more around the super-computing hardware infrastructure that was developed with an enormous 600 Nvidia DGX A100 nodes. To wrap this video up, let's add the Megatron-Turing language model to the list so that we can compare it with the other models. The objective around the Megatron-Turing language model seems to be mostly around hardware infrastructure, and this model was one of the largest dense decoder models coming in at 530 billion parameters.
Contents
-
-
-
-
-
GPT-34m 32s
-
GPT-3 use cases5m 27s
-
Challenges and shortcomings of GPT-34m 17s
-
GLaM3m 6s
-
Megatron-Turing NLG Model1m 59s
-
Gopher5m 23s
-
Scaling laws3m 14s
-
Chinchilla7m 53s
-
BIG-bench4m 24s
-
PaLM5m 49s
-
OPT and BLOOM2m 51s
-
GitHub models2m 43s
-
Accessing Large Language Models using an API6m 25s
-
Inference time vs. pre-training4m 5s
-
-