LLMs: Do They Actually Learn — or Are They Just Fooling Us?

Neo closes his eyes. A second later he opens them and says:

"I Know Kung Fu!"

The famous scene from The Matrix, where he is trained by Morpheus. Before that moment, he had no knowledge of Kung Fu whatsoever.

Introduction

Various research papers suggest this "I Know Kung Fu!" moment also applies to AI — specifically Large Language Models (LLMs). Out of seemingly nowhere, they appear to acquire new capabilities.

Yet in recent discussions, I notice growing scepticism toward researchers who claim that AI learns independently and acquires new skills on its own.

Since the preview of Sora, the debate on X and Reddit has flared up again: "Emergent learning — that's impossible!"

Is it?

"Are AIs really 'learning', or are they running sophisticated tricks we don't yet understand?"

In this article I want to explore that question more deeply. Do Large Language Models (LLMs from here on) truly have "emergent" capabilities? Or are they running clever tricks that we — for now — don't understand?

Let's start at the beginning.

Part 1: Training — How Do You Teach an LLM?

To get a better grasp of this, it helps to understand the fundamentals. Hopefully this will make this magical technology a little more understandable.

So: how does an LLM actually learn?

It all starts with training. Training an LLM involves, in brief, three phases:

Phase 1: Pre-Training
Phase 2: Fine-Tuning
Phase 3: Reinforcement Learning from Human Feedback

Phase 1: Pre-Training

The first phase focuses on giving the model a broad understanding of language. It is about building a foundation of knowledge to draw from. Without basic skills, complex tasks are hard to perform — just like with humans.

This is achieved by training the model on large datasets, such as Common Crawl or Wikipedia, which contain a wide range of text.

During pre-training, LLMs use a transformer architecture — a type of deep learning model designed to handle sequential data such as text. This model is actually relatively new. It was first published in 2017 in the paper: "Attention Is All You Need".

Transformers are particularly suited for processing text because they can handle entire sequences of data simultaneously (parallel processing) and identify dependencies within text. There are three key discoveries that together play a crucial role in the success of the Transformer Model:

Attention Mechanisms
Positional Encoding
Advanced Activation Functions

Part 1: Attention Mechanisms

The Attention Mechanism was introduced by Bahdanau et al. (2014). It allows the Transformer Model to focus on different parts of a text prompt. It mimics the human ability to pay attention to specific important details in a sentence.

This component is crucial for understanding the context and relationships between words. Consider the sentence:

"The cat sat on the mat and watched the dog in the garden."

Attention Mechanisms help the model identify relationships between "cat", "mat", "dog" and "garden". The model understands that the cat is not playing in the garden, but watching the dog who is. Crucial for tasks like summarising, translating or answering questions.

Part 2: Positional Encoding

In language, the meaning of a sentence depends on word order. Transformers process the entire input in parallel rather than sequentially — which offers a significant performance improvement. But how do they remember the order?

This is where Positional Encoding comes in. It ensures that the transformer remembers the order of input. Without positional information, a sentence can mean something entirely different:

"I did not win, but I was happy."
"I won, but I was not happy."

With positional encoding, models can understand the order — crucial for tasks like translating languages.

Part 3: Advanced Activation Functions

A neural network consists of an input layer, an output layer and multiple hidden layers that perform complex calculations. Each layer consists of multiple "neurons" that can be activated depending on the input.

The activation function determines whether a neuron in a hidden layer gets activated. Without an activation function, neural networks could only model linear relationships — while reality is often complex and non-linear.

Take house prices as an example. These are based on location, size, garden, condition, macroeconomic situation — no linear calculation. Thanks to activation functions, neural networks can generate complex outputs: from house price predictions to medical diagnoses based on MRI scans.

Various types of activation functions exist. OpenAI uses a variant of Gated Linear Units (GLU). Meta's LLaMA-2 uses Swish-Gated Linear Units (SwiGLU), a function from 2019.

The result of pre-training is a model with a general understanding of language that can generate text in the right context. At this point, the model still lacks the ability to understand specific instructions or produce responses for particular tasks.

Phase 2: Fine-Tuning (Supervised)

This phase focuses on refining the model's ability to understand specific prompts and respond to them. It is about transitioning from a broad understanding of language to a more focused, task-oriented application — for example, specific to a certain industry or company.

The model is trained on a new, specific dataset. This process teaches the model to produce the right responses in response to a prompt.

The result is a model that not only understands language, but can also generate answers based on acquired knowledge and specific instructions. This makes the model more useful for practical applications.

Phase 3: Reinforcement Learning from Human Feedback 🍪

The final phase, Reinforcement Learning from Human Feedback (RLHF), focuses on aligning the model's outputs more closely with human norms and values.

People evaluate outputs based on criteria such as helpfulness and accuracy. These evaluations are used to train the model to recognise which types of responses are preferred.

The evaluation uses a reward model — so called because it works by giving the model "rewards" for answers that closely match human preferences. Sit! Good boy! 🍪🐶

For each action or response, a numerical score — the reward — is assigned. A higher score indicates better alignment with the desired outcome. The model uses these scores to learn which types of responses were positively rated.

Models Are Getting Bigger and Bigger...

Language models learn through pattern recognition. This involves both neurons and parameters. While neurons are the basic unit of the neural network, two parameters are added:

Weight: Determines the influence of one neuron on another. Positive weight increases the activation of the next neuron; negative weight can reduce it.
Bias: A parameter added to the input of the activation function, allowing a neuron to be activated even with low input values.

The number of parameters in a neural network is the sum of the number of weights (from connections between all neurons) plus the number of biases. GPT-4 likely has 1.76 trillion parameters, and the largest LLaMA-2 has 70 billion parameters.

No wonder Sam Altman is looking to raise $7 trillion in investment...

Part 2: Kung Fu, Out of Nowhere! Emergent Abilities

Imagine a child trying to draw a person.

In their first year: nothing special. The second year: a small improvement. The third year: another small improvement. But suddenly, at age four: a masterpiece!

So we train the model and slowly it acquires skills. Like a child who slowly learns to draw better.

But alongside these gradual improvements, scaling LLMs also leads to other interesting behaviour — behaviour the LLM was not trained for. As LLMs scale, they suddenly reach new capabilities. These get "unlocked." As if a four-year-old child could suddenly draw a Rembrandt.

A four-year-old who has never drawn before picks up a pencil and draws... a Rembrandt.

LLMs were not directly trained to have these capabilities, and they appear rapidly and unpredictably — as if emerging from nowhere.

Researchers call these capabilities "emergent". The definition:

Emergent: A capability that is not present in a small model, was not trained for, but is 'unlocked' in a larger model.

Two criteria must be met for something to qualify as "emergent":

Sharpness: An unlikely transition from "capability not present" to "capability present."
Unpredictability: Seemingly unforeseen and not predictable when scaling the model.

To date, 137 emergent capabilities have been discovered in larger models. One example is recognising films from emoji:

🧛‍♂️🦇📚 = "Dracula"
🕷️👨‍🦰🏙️ = "Spider-Man"

This scaling not only improves performance and efficiency, but also reveals capabilities that were previously hidden. These capabilities surface when a certain scale is reached.

Or so people think.

A Different Perspective

In the paper "Are Emergent Abilities of Large Language Models a Mirage?" by Rylan Schaeffer, Brando Miranda and Sanmi Koyejo of Stanford University, this is called into question.

The paper challenges the conventional understanding of emergent abilities, proposing that these phenomena are more related to the methods and visualisations used than to new intrinsic properties of the models.

The authors show that what has been observed as emergent could actually be predictable, gradual improvements — masked by the choice of indicators and metrics.

This insight has important implications for LLM research, particularly when evaluating a model's capacity. It also unmasks some of the magical abilities attributed to LLMs. Additionally, it challenges the idea that the parameter race is the only way to create better language models — which is probably not the case.

Conclusion: Is It Emergent?

Is the emergent capability of Large Language Models a direct result of their size?

The answer may be: maybe.

Recent research suggests that the emergence of such capabilities cannot be attributed exclusively to model size. It could also lie in factors such as how the model is scaled or the specific metrics and visualisations chosen by researchers.

Nevertheless, the size of LLMs plays a crucial role in their applicability and impact on our society. With the current state of technology, larger models show significantly more capabilities than their smaller counterparts.

However, the precise dynamics of emergent capabilities remain unclear for now. What is clear is the need for better standardisation in measuring the performance and capabilities of these models.

A standardised framework for uniformly testing capabilities would be a step in the right direction.

There are several essential considerations regarding the "emergent" capabilities of large language models:

It is unclear at what scale new capabilities will appear
The precise level of these capabilities remains unknown until they manifest
The full spectrum of potential capabilities remains unclear

Sources

Vaswani et al. (2017). Attention Is All You Need. arxiv.org/abs/1706.03762
Schaeffer et al. (2023). Are Emergent Abilities of Large Language Models a Mirage? arxiv.org/abs/2304.15004
AssemblyAI. Emergent Abilities of Large Language Models. assemblyai.com
Wei, J. List of Emergent Capabilities. jasonwei.net
Deepgram. Activation Functions. deepgram.com

Read also

Curious which jobs AI impacts most? Explore the AI Exposure Map or take the free US Career Scan.