From the course: Generative AI: Working with Large Language Models

Megatron-Turing NLG Model

From the course: Generative AI: Working with Large Language Models

Megatron-Turing NLG Model

- [Instructor] A lot of the research after GPT-3 was released seemed to indicate that scaling up models improved performance. So Microsoft and Nvidia partnered together to create the Megatron-Turing NLG model, a massive three times more parameters than GPT-3. Modelwise, the architecture uses the transformers decoder just like GPT-3, but you can see that it has more layers and more attention heads than GPT-3. So for example, GPT-3 has 96 layers while as Megatron-Turing's NLG has 105. GPT-3 has 96 attention heads, and Megatron-Turing's NLG model has 128 and finally, Megatron-Turing's NLG model has 530 billion parameters versus GPT-3's 175 billion. Now, the researchers identified a couple of challenges with working with large language models. It's hard to train big models because they don't fit in the memory of one GPU because it would take a long time to do all the compute operations required. Efficient parallel techniques, scalable on both memory and compute, can help to use the full potential of thousands of GPUs. Although these researchers achieved superior zero, one and few-shot learning accuracies on several NLP benchmarks and established some new state-of-the-art results, a lot of their success is probably more around the super-computing hardware infrastructure that was developed with an enormous 600 Nvidia DGX A100 nodes. To wrap this video up, let's add the Megatron-Turing language model to the list so that we can compare it with the other models. The objective around the Megatron-Turing language model seems to be mostly around hardware infrastructure, and this model was one of the largest dense decoder models coming in at 530 billion parameters.

Contents