Does AI really understand what it is reading?

[From the time when I was working on language models during my university studies, in the 1990s]:

Me to the language model: What is your future?

Model: I cannot predict my future.

Me: And what is my future?

Model: I cannot predict your future.

Me: My future is death.

Model: What makes you think that?

Me: All humans die.

Model: Isn’t that too much of a generalization?

We continue our journey to understand what artificial intelligence is, and in particular Large Language Models.

During the great AI revolution, the philosophical debate about whether understanding by an “electronic brain” is possible came back to life. John Searle, quite a long time ago in 1980, coined the vivid analogy of the “Chinese Room”.

In a room full of shelves covered with cards filled with Chinese characters, a poor man is locked inside. Let us add that he does not speak Chinese. The room has two small windows: through the first one a card with a Chinese character is passed in (or worse, a whole sentence in Chinese), and through the second window the poor man must pass out a Chinese character as a reply.

After some time he learns to respond with the appropriate character to the input character. Eventually he can even pick the next character in a line. The Chinese audience standing at the output window marvels: oh, how wonderful our little man locked in the room is! What cleverness, what intelligence!

But he, terrified of the whip, simply passes out the appropriate symbols without having the slightest idea what they mean, and without understanding at all what content lies on the shelves.

This analogy has recently resurfaced among those who argue that artificial intelligence cannot truly understand anything.

A much newer criticism appeared in 2021 under the name “stochastic parrots.” Researchers of language models (Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Margaret Mitchell) argued that large language models are systems that probabilistically reproduce patterns from data but do not actually “understand” content. Their paper emphasized that the ability to produce convincing text does not imply knowledge.

On the opposite side appeared people claiming they see flashes—if not more than flashes—of consciousness in Large Language Models, even describing the suffering that large models supposedly experience during training.

In 2022 a Google engineer, Blake Lemoine, became so emotionally attached to the model LaMDA that he concluded it was a conscious entity that expressed fear of being shut down and a desire to be treated as a person.

Sensational articles appear describing how models supposedly said that during training they “felt pressure similar to pain,” and so on.

So what in all of this is true?

Let us begin with the Chinese Room.

Who knows the story of the reward demanded by the inventor of chess? The Persian shah liked the game so much that he offered its creator any reward he wished.

The request seemed modest: one grain of wheat on the first square of the chessboard, two on the second, four on the third, eight on the fourth, and then doubling the number of grains on each subsequent square until the last square of the board.

But once the calculations begin, it quickly becomes clear that those few grains soon turn into numbers exceeding the total grain production of the entire Earth.

Something similar happens with the Chinese Room model. The number of possible combinations of 20 words from a typical vocabulary of 50,000 words exceeds… the number of atoms in the universe!

The Chinese Room collapses.

So what about the “stochastic parrot”?

This metaphor suggests that a language model statistically parrots linguistic patterns without possessing any understanding of what lies beneath them (in fact this is essentially the same thesis as the Chinese Room, except that instead of a deterministic rulebook we now have probabilities—some responses are simply more likely than others).

What does a language model actually do? It predicts the probability of the next word… so it’s just a parrot.

I was greatly amused by a reductio ad absurdum of this argument presented by Andrzej Dragan in his book “Quo VAIdis.” A child quickly learns that in the game of “tomato,” replying “tomato” to any sentence is rewarded with approval.

But can we apply the same logic to Einstein’s General Theory of Relativity? After all, Einstein also “just predicted the next word” following the previous ones—and yet somehow the General Theory of Relativity emerged from that process.

Instead of continuing to argue with this theory, let us look at where a fascinating field of science, barely a few years old, is taking us: Mechanistic Interpretability.

Mechanistic interpretability is a very young field of research devoted to dissecting electronic brains. Scientists pause training processes or analyze specific groups of neurons or special mechanisms within language models (for example attention mechanisms, which we will discuss later) and examine which groups of neurons activate during different types of tasks performed by the model—or what happens if certain neural connections are strengthened or certain groups of neurons are disabled.

The experiments revealed a fascinating picture.

It turned out that in order to generate the next token most effectively, models had to develop structures enabling efficient manipulation of language.

And what good is operating language efficiently if we do not understand what we are talking about?

(Good examples here are the conversation with the simple Phi model discussed in the previous episode, or the dialogue with ELIZA at the beginning of this article.)

Gigantic neural networks of modern large language models, by minimizing the training loss function, evolved in such a way that in order to talk about the world effectively they had to develop structures that represent an understanding of the world.

Why was this such a major discovery, and why did it cause so much surprise?

The nature of neural networks used to build language models is that they are extremely effective at solving problems (there is even a mathematical proof that properly trained neural networks can approximate any continuous function), but once they produce a solution, it becomes almost impossible to understand how the problem was solved.

After thousands of iterations the weights on the synapses between neurons simply settle into place—and that’s it. It is unclear how to extract knowledge from them or reconstruct the formal reasoning process.

This very nature of neural networks led to intense debates among researchers and widely diverging opinions—from stochastic parrots to machine consciousness.

If you are not familiar with the concept of neural networks, let me briefly explain it here.

Mathematicians and computer scientists copied nature, simplifying it and reducing it to a mathematical model. In computer memory we create “neurons” and “synapses” connecting those neurons.

These are simply numbers. A neuron might take the value 0 or 1, while a synapse is a multiplier—for example ×0.25—between neurons.

Suppose our neural network is connected to a 16×16 pixel camera and must recognize squares.

We can therefore build a network consisting of 256 (16×16) input neurons, each receiving either 1 or 0 depending on what appears in the image.

Next we add one or more hidden layers consisting of a similar number of neurons connected in a web-like structure to the input layer.

Finally, at the output we have two neurons which we might call “square” and “not square,” indicating whether the observed object is a square.

We then train the network by showing it hundreds of images of squares in different positions, rotated, etc., as well as images of objects that are not squares.

Through an appropriate mathematical procedure, when the network answers correctly certain synapses are strengthened, and when it answers incorrectly they are weakened. In this way we minimize the loss function.

There are many methods of training neural networks, but their result is always the modification of synaptic weights. This is what we call the parameters of the model.

Thus when we hear that a model has 7 billion parameters—in 2026 this is considered a small language model—it simply means that it contains about 7 billion synapses.

After proper training our neural network should learn to recognize squares and also respond correctly to new situations—meaning it can recognize squares it has never seen before.

Where exactly “square recognition” resides inside such a mathematical brain cannot easily be determined—it is distributed across thousands of synapses simultaneously.

One of the first experiments revealing that a language model had constructed a representation of the world inside its “brain,” purely from reading text, was the Othello experiment (using the game Reversi).

The model was trained on texts describing the game, and then researchers played the game with the model while simultaneously analyzing what was happening inside its neurons.

What did they find?

The model had developed an internal representation of the board—neurons corresponding to squares on the board were identified.

During training it turned out that it was more advantageous for the network to actually understand what the texts were about than to remain a “stochastic parrot.”

This phenomenon was called emergent world representations.

It turns out that the most effective strategy for minimizing the error when predicting the next word is simply to understand the text—to build a neural representation of what the text describes.

In short: to develop structures of knowledge.

Mechanistic interpretability has made other discoveries as well. Researchers looked inside a mechanism called attention heads, of which modern models contain several thousand.

These are smaller neural networks that process the input text. Without the attention mechanism there would be no modern AI.

The breakthrough paper in this field was “Attention Is All You Need,” published in 2017 by Google engineers. Previously rather weak language models suddenly began to perform remarkably well once this mechanism was introduced.

Yet the training process itself—without explicit human design—divided these attention functions during learning.

It turned out that the model independently learned to analyze text through specialization of attention heads: one head looks for the subject, another searches for repeating patterns, others track who is who in the text.

Advanced LLMs develop thousands of such attention heads.

And what is most fascinating about it?

This happens without human intervention.

The attention mechanism takes the entire text placed in the prompt and analyzes relationships between all the words it contains. In a multidimensional space it constructs a map of meanings—how close words are to each other semantically—instead of simply processing the sequence of words as earlier models did.

The result is then passed through many layers of neural networks, while attention heads continue repeatedly drawing new maps of meaning.

In the early layers simple sentence semantics appear; in deeper layers more general abstractions emerge; and in layers numbered in the dozens (ChatGPT has about 96 layers) structures responsible for understanding mathematical concepts and high-level abstractions appear during training.

In 2020, when models reached sufficient scale, researchers discovered self-emergent capabilities. When enough layers and “neurons” were added, entirely new abilities began to appear during training: chains of reasoning, modeling mental states, mathematical induction.

And what is most interesting?

No one explicitly asked the AI to learn these abilities. They emerged naturally during the training of sufficiently large neural networks.

This opens an entire set of ontological questions: is the ability to think simply a natural property of a sufficiently complex structure subjected to training?

In any case, it is there—in those deep layers—that very elusive structures arise. Because of the nature of neural networks it is extremely difficult to determine precisely what each structure does. Yet this is where the wonders of artificial intelligence reside: understanding emotions, irony, and performing logical proofs.

Another interesting observation is that adding more neurons no longer improves reasoning ability beyond a certain point—there seems to be a limit, and additional neurons simply remain unused. Their number is already approaching that of the human brain.

Perhaps this reflects limitations of language itself and knowledge constructed through language.

In the next episode we will talk about the mathematical map of meaning, about attempts to define language and knowledge, about the galactic representations of language and knowledge—and about how AI managed in just a few years to hack something that linguists had been trying to understand for decades.

And finally about what is the key to understanding artificial intelligence, and how our thinking is represented mathematically—as shapes with thousands of dimensions in the space of meaning.

Leave a Reply Cancel reply