How I Learned What Actually Happens Inside an LLM

Recently I spent some time trying to understand what actually happens inside a large language model (LLM). At first, it looked like magic, but after digging deeper, I realised that the core idea is surprisingly simple.

An LLM (Large Language Model) is essentially a very large neural network trained on huge amounts of text data. During training, it reads billions or even trillions of words collected from books, articles, websites, code repositories, research papers, and many other sources.

The goal of the model is simple:

Predict the next token given all previous tokens.

Everything else emerges from this objective.

The Basic Idea

Imagine I give the model the sentence:

The capital of France is

The model does not search the internet for the answer. Instead, based on patterns learned during training, it calculates probabilities for possible next tokens.

Something like:

The model then selects a token based on these probabilities.

After generating "Paris", it repeats the same process for the next token.

Input:

The capital of France is Paris generation example

Generation is simply this prediction process repeated over and over.

Why Neural Networks Need Training

An LLM is made up of billions of parameters (often called weights). These weights determine how information flows through the network. Initially, the weights are random, and the model produces nonsense.

During training:

The model predicts the next token.
The prediction is compared with the actual token.
The error is calculated.
The weights are adjusted slightly.
The process repeats billions of times.

Over time, the network gradually learns:

grammar
sentence structure
facts
reasoning patterns
coding syntax
language relationships

This learning process is done using optimisation algorithms such as gradient descent and backpropagation.

The First Problem: Computers Do Not Understand Words

Computers do not understand English the way humans do. To a computer, words such as 'cat', 'dog', 'apple', or 'happy' are simply sequences of characters with no inherent meaning. It does not naturally know that 'cat' and 'dog' are related animals, that 'happy' and 'joyful' have similar meanings, or that 'good' and 'bad' are opposites. To make language understandable to a neural network, we need a way to represent the meaning and relationships between words mathematically.

Word Embeddings

To make language understandable to a neural network, words are converted into vectors, which are simply lists of numbers. As shown above, words such as "good," "bad," "happy," and "sad" are represented by numerical vectors instead of plain text. These vectors place words in a high-dimensional mathematical space where words with similar meanings end up closer together and words with opposite meanings end up farther apart. The model learns these relationships automatically during training, allowing it to capture semantic meaning mathematically. This idea, known as word embeddings, was a major breakthrough because it gave computers a way to represent and reason about relationships between words rather than treating them as unrelated symbols.

The Vocabulary Problem

Once words could be represented mathematically, another challenge appeared: vocabulary size. English contains hundreds of thousands of words, and many words have multiple forms, such as "run", "running", "runner", and "runnable". If every word received its own token, the vocabulary would become extremely large, making training and storage much more expensive.

Character-Based Tokenization

One possible solution was to split text into individual characters. This greatly reduces vocabulary size because only letters, numbers, and symbols need to be stored. However, words become much longer sequences of characters, which increases computation and makes learning language relationships more difficult. Researchers needed a solution between full-word tokens and character-level tokens.

Subword Tokenization

The solution was subword Tokenization. Instead of treating words as complete units or breaking them into individual characters, words are split into reusable pieces. For example, playing can become play + ing, while unbelievable can become un + believe + able. This approach keeps the vocabulary manageable while still preserving meaning. Most modern LLMs use some form of subword Tokenization.

Byte Pair Encoding (BPE)

One of the most popular Tokenization methods is Byte Pair Encoding (BPE). The process starts with individual characters and repeatedly merges character pairs that appear frequently together. Over time, common patterns become complete tokens. This allows the vocabulary to be built automatically from real language usage rather than being manually designed.

From Tokens to Token IDs

After Tokenization, each token is assigned a unique numerical identifier called a token ID. For example, a sentence such as "I love pizza" may be converted into a sequence of token IDs. These IDs allow text to be processed by computers efficiently, but by themselves they contain no meaning. The number assigned to a token is simply a label.

The Embedding Layer

Since token IDs do not carry meaning, they are passed through an embedding layer, which converts each token ID into a dense vector of numbers. This is where language is transformed into mathematics. Similar words tend to have similar embeddings, allowing the model to learn relationships between concepts. The transformer never works directly with words; it operates entirely on these embedding vectors.

The Transformer

The next stage is the Transformer, introduced in the 2017 paper Attention Is All You Need. The key innovation of the transformer is the attention mechanism, which allows tokens to focus on other relevant tokens in the sentence when determining meaning. For example, in the sentence:

The animal didn't cross the street because it was tired.

the model can learn that "it" refers to "animal" rather than "street" by paying attention to the surrounding context. This ability to capture relationships between words at different positions is one of the main reasons transformers became the foundation of modern LLMs.

Complete Text-to-Output Flow

Why LLMs Seem Intelligent

Interestingly, the model was never explicitly taught:

grammar
coding
reasoning
translation
summarization

It only learned to predict the next token.

Yet because it saw enormous amounts of text, many capabilities emerged naturally.

This phenomenon is called 'emergent behaviour'.

Limitations

Even though LLMs are powerful, they have important limitations.

Hallucinations

The model may generate information that sounds correct but is actually false.

Example:

fake references
fake URLs
invented facts

The model optimises for likely text, not truth.

Bias

The model learns from human-generated data.

If the training data contains biases, the model may reproduce them.

Examples include:

language bias
cultural bias
selection bias
political bias

Common LLM Settings

Modern LLMs expose several settings that control how text is generated.

Temperature

Temperature controls randomness.

A low temperature makes the model choose the most likely next token more often, producing focused and predictable answers.

A high temperature increases randomness, making outputs more creative but sometimes less reliable.

Example: Use a low temperature for factual Q&A and a higher temperature for brainstorming or creative writing.

Top-P (Nucleus Sampling)

Top-P controls the pool of candidate tokens the model can choose from.

Instead of considering every possible token, the model keeps selecting tokens until their combined probability reaches the top-P value, then chooses from that smaller set.

Example: A lower Top-P produces more focused responses, while a higher Top-P allows more variety.

Max Length

Max length limits how many tokens the model can generate before stopping.

Example: Useful when generating summaries, short answers, or responses with strict length limits.

Stop Sequences

A stop sequence is a token or string that tells the model when to stop generating text.

Example: If the stop sequence is "11.", asking for a numbered list will cause generation to stop before item 11, producing only 10 items.

Frequency Penalty

Frequency penalty reduces the chance of repeatedly generating words that have already appeared many times.

The more often a token appears, the stronger the penalty becomes.

Example: Useful when the model keeps repeating the same phrases or keywords.

Presence Penalty

A presence penalty discourages reuse of tokens that have already appeared at least once.

Unlike a frequency penalty, the penalty does not increase with repeated occurrences.

Example: Useful for encouraging the model to introduce new topics or ideas instead of revisiting the same ones.

Prompt Engineering and Common Prompting Techniques

Now that we have gone through the core concepts of LLMs, let's talk about prompting. At its core, prompting is about influencing probabilities.

An LLM generates text by predicting the most likely next token. Every word, instruction, example, or piece of context you add to a prompt changes those probabilities. This is why even small changes in wording can sometimes produce very different outputs.

In general, the more relevant information you provide, the higher the probability that the model will generate the output you are looking for.

General Prompting Tips

A few simple practices usually improve results:

Start with a clear and simple instruction.
Use action words such as 'write', 'summarise', 'classify', 'translate', 'explain', or 'extract'.
Separate instructions, context, input data, and expected output format.
Be specific about what you want.
Avoid adding unnecessary details that may confuse the model.

A common prompt structure looks like:

Instruction

------------

Context

------------

Input Data

------------

Expected Output Format

In many cases, this simple structure is enough to get good results.

Common Prompting Techniques

Zero-Shot Prompting

Zero-shot prompting means giving the model only instructions without examples.

Example uses:

Summarization
Translation
Text classification
Explaining concepts
Generating ideas

This works best when the task is straightforward and the model already understands what is being asked.

Few-Shot Prompting

Few-shot prompting provides a few examples before asking the model to solve a new problem.

The examples show the pattern that should be followed.

Example uses:

Specific output formats
Domain-specific tasks
Unusual question styles
Consistent response formatting

The model learns the pattern from the examples and continues it.

Chain of Thought (CoT)

Chain of Thought encourages the model to reason through multiple steps before reaching an answer.

Instead of jumping directly to a conclusion, the model breaks the problem into smaller reasoning steps.

Example uses:

Mathematics
Logic problems
Planning
Multi-step analysis

This technique is useful when solving the problem requires intermediate reasoning.

Self-Consistency

Self-consistency extends the chain of thought.

Instead of generating a single reasoning path, the model generates multiple independent reasoning paths internally and selects the answer that appears most consistently across them.

Example uses:

Mathematical reasoning
Competitive problem-solving
Logic puzzles
Scientific reasoning

This often improves accuracy on difficult reasoning tasks.

Tree of Thoughts (ToT)

Tree of Thoughts expands the idea further.

Instead of following one reasoning path, the model explores multiple possible paths, evaluates them, discards weak paths, and continues exploring promising ones.

You can think of it as a search algorithm operating over reasoning steps.

Example uses:

Complex planning
Strategy problems
Product design
Business decision-making
Puzzle solving

Prompt Chaining

Prompt chaining breaks a large task into multiple smaller prompts.

The output of one prompt becomes the input of the next prompt.

Example:

Research → Summarize → Generate Report → Review Report

This often produces better results than asking the model to do everything in a single prompt.

RAG (Retrieval-Augmented Generation)

RAG provides external knowledge to the model before it generates an answer.

Instead of relying only on what was learned during training, the model retrieves relevant information from documents, databases, or knowledge bases.

Example uses:

Company documentation
Internal knowledge bases
PDFs
Product manuals
Frequently changing information

This significantly reduces hallucinations when accurate source information is available.

Reflection and Critique Prompting

In this technique, the model reviews its own answer before producing a final response.

The model acts as its own reviewer and attempts to identify mistakes, missing information, or weak reasoning.

Example uses:

Code review
Report generation
Research summaries
Quality checking

Role Prompting

Role prompting assigns a role or perspective to the model.

Examples:

"Act as a software architect."

This helps guide the style, vocabulary, and perspective of the response.

Final Thoughts

Prompt engineering is an iterative process rather than a fixed set of rules. Different models, model versions, and providers may respond differently to the same prompt.

A technique that works well today may become less important as models improve. Modern LLMs are becoming increasingly good at understanding user intent, which means many tasks now require much less prompt engineering than they did a few years ago.

However, the core idea remains the same:

Every word added to a prompt changes the probability distribution of the next generated token. Prompt engineering is ultimately the art of shaping those probabilities to guide the model toward the desired output.

My Thoughts on LLMs and Understanding

One debate that comes up often is whether LLMs actually understand things or whether they are simply doing very advanced pattern matching.

Personally, I lean more toward the pattern-matching side.

The interesting thing is that the results often look like understanding. The model can answer questions, explain concepts, write code, summarise documents, and even reason through problems. From the outside, it feels like understanding because the outputs are often correct and useful.

But when I look at how these systems are trained, it still feels like a very sophisticated form of pattern matching. They have seen enormous amounts of data and learned relationships between tokens, concepts, and ideas. When we ask a question, they generate an answer based on those learned patterns.

At the same time, I don't have complete confidence in that opinion.

Human intelligence evolved over billions of years. The human brain is the result of an incredibly long process of evolution. If humans are able to create something with human-like reasoning or learning capabilities in just a few decades, that would be a remarkable achievement. Sometimes it feels difficult to believe that we have already recreated something comparable to human understanding.

I also wonder whether humans themselves are doing something that is not completely different. We are constantly predicting. We predict what people mean, what might happen next, what words to say, and what actions to take. In some sense, human thinking also looks like a prediction process that never completely stops.

If one day we create AI that truly thinks like a human, I suspect it may inherit some of the same limitations that humans have.

Humans need rest. Humans consume energy. Humans cannot think forever without fatigue. Humans forget things. Humans need resources to survive.

If future AI systems develop intelligence that is much closer to human intelligence, I wonder whether similar constraints will appear in some form. Maybe they will need downtime, energy, memory management, or other limitations that resemble human needs.

And if we eventually reproduce every important aspect of human intelligence, then an interesting question appears:

What was the point of building it if we ended up recreating many of the same limitations?

I don't know the answer.

What I do know is that LLMs are one of the most interesting technologies I have ever learned about. Their capabilities are already impressive, and the pace of improvement is surprisingly fast.

The future of AI feels both exciting and uncertain. It is difficult to predict where these systems will be in five, ten, or twenty years. For now, I think the most honest answer is that we are still figuring out what intelligence really is, and that makes the future hard to predict.