Large language models (LLMs) are deep learning algorithms that can recognize, summarize, translate, predict, and generate content using very large datasets.
Large language models largely represent a class of deep learning architectures called transformer networks. A transformer model is a neural network that learns context and meaning by tracking relationships in sequential data, like the words in this sentence.
A transformer is made up of multiple transformer blocks, also known as layers. For example, a transformer has self-attention layers, feed-forward layers, and normalization layers, all working together to decipher input to predict streams of output at inference. The layers can be stacked to make deeper transformers and powerful language models. Transformers were first introduced by Google in the 2017 paper “Attention Is All You Need.”
Figure 1. How transformer models work.
There are two key innovations that make transformers particularly adept for large language models: positional encodings and self-attention.
Positional encoding embeds the order of which the input occurs within a given sequence. Essentially, instead of feeding words within a sentence sequentially into the neural network, thanks to positional encoding, the words can be fed in non-sequentially.
Self-attention assigns a weight to each part of the input data while processing it. This weight signifies the importance of that input in context to the rest of the input. In other words, models no longer have to dedicate the same attention to all inputs and can focus on the parts of the input that actually matter. This representation of what parts of the input the neural network needs to pay attention to is learnt over time as the model sifts and analyzes mountains of data.
These two techniques in conjunction allow for analyzing the subtle ways and contexts in which distinct elements influence and relate to each other over long distances, non-sequentially.
The ability to process data non-sequentially enables the decomposition of the complex problem into multiple, smaller, simultaneous computations. Naturally, GPUs are well suited to solve these types of problems in parallel, allowing for large-scale processing of large-scale unlabelled datasets and enormous transformer networks.
Historically, AI models had been focused on perception and understanding.
However, large language models, which are trained on internet-scale datasets with hundreds of billions of parameters, have now unlocked an AI model’s ability to generate human-like content.
Models can read, write, code, draw, and create in a credible fashion and augment human creativity and improve productivity across industries to solve the world’s toughest problems.
The applications for these LLMs span across a plethora of use cases. For example, an AI system can learn the language of protein sequences to provide viable compounds that will help scientists develop groundbreaking, life-saving vaccines.
Or computers can help humans do what they do best—be creative, communicate, and create. A writer suffering from writer’s block can use a large language model to help spark their creativity.
Or a software programmer can be more productive, leveraging LLMs to generate code based on natural language descriptions.
Advancements across the entire compute stack have allowed for the development of increasingly sophisticated LLMs. In June 2020, OpenAI released GPT-3, a 175 billion-parameter model that generated text and code with short written prompts. In 2021, NVIDIA and Microsoft developed Megatron-Turing Natural Language Generation 530B, one of the world’s largest models for reading comprehension and natural language inference, with 530 billion parameters.
As LLMs have grown in size, so have their capabilities. Broadly, LLM use cases for text-based content can be divided up in the following manner:
Generation (e.g., story writing, marketing content creation)
Summarization (e.g., legal paraphrasing, meeting notes summarization)
Translation (e.g., between languages, text-to-code)
Classification (e.g., toxicity classification, sentiment analysis)
Chatbot (e.g., open-domain Q+A, virtual assistants)
Enterprises across the world are starting to leverage LLMs to unlock new possibilities:
Large language models are still in their early days, and their promise is enormous; a single model with zero-shot learning capabilities can solve nearly every imaginable problem by understanding and generating human-like thoughts instantaneously. The use cases span across every company, every business transaction, and every industry, allowing for immense value-creation opportunities.
Large language models are trained using unsupervised learning. With unsupervised learning, models can find previously unknown patterns in data using unlabelled datasets. This also eliminates the need for extensive data labeling, which is one of the biggest challenges in building AI models.
Thanks to the extensive training process that LLMs undergo, the models don’t need to be trained for any specific task and can instead serve multiple use cases. These types of models are known as foundation models.
The ability for the foundation model to generate text for a wide variety of purposes without much instruction or training is called zero-shot learning. Different variations of this capability include one-shot or few-shot learning, wherein the foundation model is fed one or a few examples illustrating how a task can be accomplished to understand and better perform on select use cases.
Despite the tremendous capabilities of zero-shot learning with large language models, developers and enterprises have an innate desire to tame these systems to behave in their desired manner. To deploy these large language models for specific use cases, the models can be customized using several techniques to achieve higher accuracy. Some techniques include prompt tuning, fine-tuning, and adapters.
Figure 2. Image shows the structure of encoder-decoder language models.
There are several classes of large language models that are suited for different types of use cases:
The significant capital investment, large datasets, technical expertise, and large-scale compute infrastructure necessary to develop and maintain large language models have been a barrier to entry for most enterprises.
Figure 3. Compute required for training transformer models.
NVIDIA offers tools to ease the building and deployment of large language models:
Despite the challenges, the promise of large language models is enormous. NVIDIA and its ecosystem is committed to enabling consumers, developers, and enterprises to reap the benefits of large language models.