Large Language Models (LLMs) are built on the **transformer architecture**, which replaced older recurrent networks because it can process entire sequences in parallel and capture long-range dependencies far more effectively. The core mechanism is **self-attention**: the model computes how much each token (roughly a word or subword) in the input should attend to every other token, producing a weighted representation that emphasizes relevant context. These attention layers are stacked many times (often 30–100+ layers) in an encoder-decoder or decoder-only design (modern LLMs like GPT, Llama, Grok, etc. are decoder-only). The network also uses positional encodings to preserve word order and feed-forward layers for additional non-linear transformation. The result is that after training, every token’s final representation contains rich information about the entire input sequence.
LLMs are trained in two main stages. First, pre-training on enormous text corpora (trillions of tokens from the internet, books, code, etc.) using next-token prediction (autoregressive language modeling). The model learns to predict the next token in a sequence given all previous tokens; this simple objective forces it to acquire syntax, semantics, world knowledge, reasoning patterns, and even stylistic nuances. The training objective is minimized with gradient descent across billions or trillions of parameters, usually on thousands of GPUs/TPUs for weeks or months. A smaller second stage, **fine-tuning** (often instruction tuning or RLHF — Reinforcement Learning from Human Feedback), aligns the model to follow instructions, be helpful, reduce harmful outputs, and improve coherence on dialogue/tasks.
At inference time, the model works autoregressively: you give it a prompt, it converts the text into tokens, passes them through the transformer layers to get a probability distribution over the next token, samples or picks the most likely one (using temperature, top-p, etc. to control creativity), appends it to the input, and repeats until an end-of-sequence token or a maximum length is reached. This process explains both their fluency (from massive pre-training) and their occasional hallucinations (they are essentially very sophisticated pattern-matchers that can confidently continue plausible-sounding but factually wrong sequences when the correct continuation was underrepresented in training data). The larger the model and the better the training data, the more reliable and capable the completions become.

