The Secret Sauce of AI: How Foundation Models Are Trained

No description has been provided for this image

What is a Training Objective?

The Birth of a Foundation Model

Have you ever wondered how a massive AI like GPT-3 learns to be so versatile? It's not magic! It all comes down to its training objective. Think of a training objective as the "rules of the game" for a foundation model. It's a mathematical function that tells the model what to do and how to learn from a huge amount of data. For instance, GPT-3's objective was to predict the next word in a sentence. It was a simple task, but by doing it over and over again on a colossal dataset from the internet, it learned the underlying patterns of language so well that it can now perform a mind-boggling variety of tasks. Let's dig into the main goals of these training approaches, the trade-offs involved, and where we're headed next.

Key Goals for Foundation Model Training

Leveraging Broad, Unlabeled Data

One of the biggest game-changers in AI has been self-supervised learning (SSL). Before this, training a model meant you had to manually label and annotate every single piece of data, which is an impossible task for internet-scale datasets. SSL, however, is a brilliant workaround. It designs training tasks that use the data's own inherent structure to generate a learning signal. So, instead of needing a human to label a picture as "cat," the model might be asked to predict a missing part of the image. This unlocks the power of a vast amount of unlabeled data—from images and videos to text and robotic sensor data—making it the essential first step in creating a foundation model.

Domain Completeness: The "Generalist" Mindset

The goal of a foundation model is to be a generalist, not a specialist. We want a model that can handle a wide variety of tasks in a domain, like language. This is where domain completeness comes in. A good training objective should be so comprehensive that solving it requires the model to acquire a broad set of capabilities. For example, to predict the next word accurately, a model needs to understand things like grammar, sentiment, and even context. This is what makes a model versatile enough to be a foundation for many different downstream applications. In contrast, a simple supervised task like classifying sentiment might only teach the model a very narrow set of skills.

Scaling and Compute Efficiency

In the world of foundation models, size matters. As models and datasets get bigger, we have to make sure our training procedures can keep up. This means they must be reliable and efficient at converting data and compute into a capable model. This has led to a fascinating shift in how we evaluate models: it's not just about how good they are, but also how much compute it takes to get them there. Researchers are always looking for ways to make training objectives more efficient, like the ELECTRA model which was 4x more efficient than its predecessor BERT. The surprising predictability of how a model's capabilities scale with data and compute is a huge help here; it allows developers to make smarter choices instead of just guessing.

Design Choices and Trade-offs in Self-Supervised Learning

Current SSL methods for training foundation models are incredibly diverse, but they all create prediction problems from unlabeled data. The specific "constraints" they put on the data or the model itself determine what kind of skills the model learns. Let's look at some key design choices and their trade-offs.

Abstraction Level: Raw Bytes vs. Tokens

A core question is what the model's input should be. Should it look at the raw bytes of an image or text? Or should it use a higher-level representation, like "tokens" which are predefined words or sub-words? Raw bytes give the model everything, but they can be computationally expensive and might distract the model with low-level details (like audio compression artifacts). Using tokens, on the other hand, can simplify the task and make it more manageable for models like transformers, but you might lose some potentially useful information, like character-level nuances needed for things like rhyming.

Generative vs. Discriminative Models

This is a big one! Generative models are trained to create new data that looks like the original. Autoregressive models (like GPT-3) generate text word by word, while denoising models (like BERT) restore corrupted or masked data. These methods are powerful because they enable flexible interaction with the model—you can ask them to complete a sentence or fix a typo. On the other hand, discriminative models don't create new data. They are trained to classify or compare inputs. For example, they might be trained to determine if two different views of an image are of the same object. While these models are often more efficient for tasks like classification, they can't generate new content. The future likely lies in finding a way to get the best of both worlds.

Capturing Multimodal Relationships

A truly intelligent model needs to understand the relationships between different types of data, like text and images. But how should it do this? Models like CLIP and ViLBERT offer two different approaches. CLIP trains separate encoders for images and text and then compares their outputs at the end. This is great for tasks like image retrieval. ViLBERT, in contrast, processes images and text together from the very beginning. This is better for tasks like visual question answering where the model needs to reason about how the image and text relate to each other. Multimodality is still a young field, and we have a lot to learn about the best ways to connect these different data types.

Paving the Way Forward for Training Algorithms

The Quest for an "Out-of-the-Box" Training Objective

Right now, the training methods for foundation models are very specific to the domain—what works for language doesn't necessarily work for images. This makes it difficult to understand the underlying principles and requires a lot of new research every time we want to apply these models to a new field. A significant milestone would be a single, general training objective that works effectively on any kind of data, from text to medical scans to scientific data. Imagine the possibilities!

Obtaining a Richer Training Signal

It's clear that some training objectives are far more efficient than others. Are there "super-efficient" training methods we don't know about yet? We also need to think about data and training as a partnership. Instead of just passively learning from the data, what if a model could actively seek out or create more informative training examples as it gets better? This could accelerate learning to an incredible degree.

Goal-Directed Training: Beyond the Basics

The current training of foundation models gives rise to amazing "emergent properties" almost by accident. But what if we could make these models' ability to achieve goals a core part of their training objective? Instead of just predicting the next word, a model could be trained to understand and reliably carry out tasks in a complex world. This could lead to models that not only "know" a lot but can also "do" a lot, like a robotic arm that learns to manipulate objects just from watching videos and interacting with its environment. This is one of the most exciting frontiers in AI.

The Future of Training is Now

The training of foundation models is a dynamic field that's at the heart of the AI revolution. By focusing on leveraging broad data, achieving domain completeness, and maximizing efficiency, we're building models that are more powerful and versatile than ever before. While challenges remain in balancing trade-offs between different approaches and finding more general training objectives, the path forward is clear: to create models that are not just knowledgeable, but truly intelligent.

FAQs on Foundation Model Training

1. What is the main difference between self-supervised learning and supervised learning? Self-supervised learning (SSL) uses the inherent structure of unlabeled data to create a training signal, while supervised learning requires data to be manually labeled by humans. SSL is a key factor in training large foundation models because it doesn't require a human to annotate massive amounts of data.

2. How does the "abstraction level" affect a model's training? The abstraction level refers to how the model sees the input data. Using a higher-level abstraction (like tokens for text or patches for images) can make training more efficient by reducing the input size and focusing on more semantic features, but it might also cause the model to miss out on valuable low-level information.

3. Can a foundation model be both generative and discriminative? While most models are primarily one or the other, researchers are actively exploring ways to combine the benefits of both. The goal is to create a model that is both highly efficient at learning from data (a discriminative trait) and capable of generating new content and interacting flexibly (a generative trait).

4. Why is "domain completeness" so important for a foundation model's training objective? A training objective that is "domain complete" forces the model to learn a wide range of skills to solve the task. This is crucial for creating a general-purpose foundation model that can be easily adapted to many different tasks, rather than being limited to a single function.

5. What does "goal-directed training" mean for the future of AI? Goal-directed training means a model would be trained not just to mimic or predict data, but to achieve specific objectives in the real world. This could lead to the development of highly capable AI that can perform complex, multi-step tasks in interactive environments, bringing us closer to truly intelligent agents.

Citation: Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2022). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. Available at: https://arxiv.org/abs/2108.07258

Menu