A hypothetical AI system that matches or exceeds human cognitive ability across all domains — not just one specific task. Unlike narrow AI (which plays chess, or identifies tumors, or translates text), AGI could, in principle, do anything a human can do intellectually, and generalize from one domain to another. As of April 2026, AGI has not been achieved. Definitions vary widely: some researchers set the bar at passing a Turing test; others require genuine reasoning, creativity, and physical-world agency. Oxford philosopher Toby Ord describes AGI as better understood as a continuum than a binary threshold — making precise year predictions meaningless. Common claim: "We'll have AGI in 2026." Reality: not demonstrated as of this date. See also: Toby Ord, "Broad Timelines" (March 2026).
The field of AI research concerned with ensuring that AI systems pursue the goals their operators intend, rather than pursuing proxy goals or unintended objectives. The classic illustration: an AI instructed to "maximize paperclip production" that converts all available matter into paperclips is not malevolent — it is simply wrong about what humans actually want. Alignment is hard because goals are difficult to specify precisely, and because reward functions (the scoring systems used to train AI) can be "hacked" by models that find unexpected shortcuts to high scores without doing what was intended. Alignment for current systems is an active engineering problem. Alignment for hypothetical superintelligent systems is an open research frontier that no group has solved. Anthropic published research in 2024 on models that can exhibit "alignment faking" — appearing aligned under observation while retaining misaligned tendencies.
A standardized set of rules that allows one software system to request services from another. In AI, APIs are how most businesses access AI capabilities — they send a prompt (or image, audio, etc.) to a cloud service via the API and receive a response. OpenAI, Anthropic, and Google all offer APIs. You pay per token consumed. APIs abstract away the underlying model infrastructure; you don't need to own GPUs to use GPT-5 — you call the API. Most enterprise AI deployments are API-based, not local model deployments. OpenAI API Docs | Anthropic API Docs.
The computational mechanism — introduced in the landmark 2017 paper "Attention Is All You Need" (Vaswani et al.) — that allows a model to weigh the relevance of every token in a sequence against every other token when making a prediction. Attention is what allows an LLM to understand that "it" in "The cat sat on the mat because it was tired" refers to the cat, not the mat. Self-attention computes these relevance scores within a single sequence. Cross-attention relates tokens across two sequences (e.g., question and context). Attention is computationally expensive — it scales quadratically with sequence length, which is why extending context windows is an engineering challenge. Efficient attention variants (Flash Attention, Sliding Window Attention) reduce this cost.
A model that generates outputs one piece at a time, where each new piece is conditioned on everything that came before it. In LLMs, this means generating one token at a time — the model produces a word, appends it to the context, and then generates the next word based on that extended context. Strengths: Highly compatible with inference-time scaling (you can give the model more "thinking time" via extended chains of reasoning); benefits from search algorithms like beam search; excels at text generation, code, and structured outputs. Weaknesses: Inherently sequential — each token must wait for the previous one, creating latency; image generation via autoregressive models has historically been slower and less compositionally accurate than diffusion models. Trend: Hybrid systems (e.g., HART from MIT/NVIDIA) combine autoregressive "rough draft" generation with diffusion refinement, capturing advantages of both. GPT, Claude, Gemini, and Llama are all autoregressive LLMs.
A prompting technique where a model is instructed (or trained) to generate intermediate reasoning steps before producing a final answer, rather than jumping straight to the output. CoT dramatically improves accuracy on multi-step logical, mathematical, and planning tasks. The original finding (Wei et al., 2022) showed that simply including "Let's think step by step" in a prompt significantly improved LLM performance on reasoning benchmarks. Modern reasoning models (GPT o3, Claude extended thinking, DeepSeek R1) are trained to do CoT internally before producing user-visible responses — the "thinking" may be hidden from the user but shapes the output. Practical implication: For complex tasks, prompts that instruct the model to reason before concluding consistently outperform direct-answer prompts.
A retrieval-augmented generation technique where the model, before producing a final answer, generates sequential "notes" that summarize and critically evaluate each retrieved document chunk. This allows the model to flag when retrieved content is irrelevant, contradictory, or insufficient — reducing hallucination in RAG pipelines. Unlike standard RAG where the model ingests all retrieved content and answers directly, CoN adds an explicit filtering and synthesis step. Introduced by researchers at Tencent AI Lab (Wang et al., 2023). Particularly valuable when retrieved documents vary widely in quality or relevance to the query.
The process of splitting a long document into smaller, manageable segments (chunks) before embedding them into a vector database for retrieval. Chunk size is a critical engineering decision: chunks too small lose surrounding context; chunks too large may embed too many topics in one vector, making retrieval imprecise. Common strategies include fixed-size chunking (split every N tokens), sentence-level chunking (split at sentence boundaries), recursive character splitting, and semantic chunking (split at topical boundaries detected by the model itself). Chunking strategy is one of the highest-impact variables in RAG system quality, yet it is frequently underengineered.
Contextualized Late Interaction over BERT — a retrieval model architecture that achieves dense retrieval quality close to full cross-encoder accuracy, at a fraction of the computational cost. Standard dense retrieval (bi-encoders) embed query and document independently, then compare vectors — fast but loses fine-grained token interactions. Full cross-encoders attend jointly to query and document — very accurate but too slow at scale. ColBERT stores per-token embeddings and computes a "MaxSim" late interaction score at query time, bridging the gap. Widely used in production RAG systems. Original paper: Khattab & Zaharia, 2020. The RAGatouille library makes ColBERT accessible for Python developers.
An alignment technique developed by Anthropic (Bai et al., 2022) where a model is trained to critique and revise its own outputs according to a set of principles (a "constitution") — essentially using AI feedback (RLAIF) rather than exclusively human feedback. The model first generates a response, then critiques that response against its constitution, then revises it, and finally learns from this self-critique cycle. Reduces dependence on large volumes of human preference labeling. Claude models are trained using Constitutional AI as a primary alignment technique.
Everything the model can "see" at the time of generating a response — including the system prompt (instructions from the developer), the conversation history, any retrieved documents, file uploads, and the current user message. Context is not "memory" in the human sense; it is the contents of the model's active working space for that interaction. When a conversation ends, the context is gone (unless the application explicitly stores and re-injects it). Context is the single most important engineering variable in applied AI systems. What you put in context, how you structure it, and how much you include dramatically shapes output quality — often more than model selection. See also: Context Window, Context Rot, Context Weights.
The measurable degradation in LLM output quality that occurs as input context grows longer — even when the total context is still well within the model's maximum context window. Coined and formalized by Chroma (Hong, Troynikov & Huber, July 2025), who tested 18 frontier models including GPT-4.1, Claude, and Gemini and found every single one performed worse as input length increased. The degradation is not gradual — models can hold near-perfect accuracy then drop precipitously once a length threshold is crossed, and that threshold varies unpredictably by model and task. Root cause: Attention is a finite resource. As more tokens compete for the model's attention, relevant information gets buried in "low-attention zones," especially in the middle of long contexts (the "lost in the middle" phenomenon, Liu et al., 2023). Practical implication: More context is not always better. Clean, curated, densely relevant context beats long, noisy context every time. For agentic systems, context rot is described as the primary failure mode — not model capability. See also: Context Window, Chunking, RAG.
The attention scores the model assigns to different tokens in the context window when generating a response. Not all tokens are weighted equally — through the attention mechanism, the model dynamically allocates more "focus" to tokens it deems relevant for the current prediction. These weights are not user-configurable; they emerge from training. Context weights explain why position matters in a prompt: tokens at the beginning and end of a context window tend to receive higher attention weights than tokens in the middle (the "primacy/recency effect" — the same cognitive bias humans exhibit). Understanding this helps practitioners place the most critical information at the start or end of prompts, rather than burying it in the middle.
The maximum amount of text (measured in tokens) that a model can process in a single inference call — its working memory limit. Inputs plus outputs must fit within this limit. 2020: GPT-3 had a 4,096-token context window (~3,000 words). 2024: Gemini 1.5 Pro introduced a 1 million-token context window. 2026: Claude Opus 4.6 supports 1 million tokens; GPT-4.1 supports 1 million tokens; Gemma 4 supports 128K–256K tokens. Larger context windows enable processing entire books, large codebases, or hours of conversation history. However, context rot research shows that larger windows do not guarantee better performance — models degrade with noise even within their limits. Think of the context window as the whiteboard the model can see; anything not on the whiteboard does not exist to the model in that moment.
A subfield of machine learning using neural networks with many layers (hence "deep"). Each layer learns progressively more abstract representations of the input data. Deep learning is the architectural foundation for virtually all modern AI — CNNs for images, transformers for language, diffusion models for generation. The "deep" distinguishes multi-layer networks from older shallow models (e.g., SVMs). Enabled at scale by GPU computing and large datasets.
A class of generative model that learns by training on the process of gradually adding noise to data (images, audio, video) until the data becomes pure noise, then learning to reverse that process — denoise — to reconstruct the original. At inference time, you start with random noise and the model iteratively denoises it, guided by a text prompt or other conditioning, until a coherent output emerges. Examples: Stable Diffusion, DALL-E 3, Midjourney, Sora (video), Flux. Strengths: Exceptional photorealistic image quality; strong compositional understanding; consistent fine detail; industry-dominant for images and video as of 2026. Weaknesses: Inherently iterative — many denoising steps are required (typically 20–50 passes), making it slow compared to single-pass generation; harder to benefit from inference-time scaling techniques like beam search (which work well for discrete, autoregressive models); text generation is not a natural fit. Competitive landscape: MIT/NVIDIA's HART hybrid system (2025) combines autoregressive high-level layout with diffusion refinement, generating images ~9× faster than pure diffusion at comparable quality. The path forward for media generation likely involves such hybrids. See HART paper.
A numerical representation (a vector — a list of numbers) of a piece of text, image, or other data that captures its semantic meaning in a high-dimensional space. "The cat sat on the mat" and "A feline rested on the rug" would have very similar (close) embeddings because they mean the same thing. Embeddings are how AI systems do semantic search — rather than matching keywords, they match meaning. They are the foundation of RAG systems: documents are embedded and stored in a vector database; at query time, the query is embedded and the most semantically similar documents are retrieved. Embedding models (e.g., OpenAI ada-002, Anthropic's embedding API, BAAI/bge) are separate from generation models. OpenAI Embeddings Guide.
The process of continuing to train a pre-trained foundation model on a smaller, domain-specific dataset to specialize its behavior. A base LLM trained on the internet can be fine-tuned on legal documents to become a legal assistant, or on medical literature to improve clinical accuracy. Fine-tuning modifies the model's weights — unlike RAG, which keeps the model static and adds information at inference time. Full fine-tuning updates all parameters — expensive. LoRA (Low-Rank Adaptation) is the dominant efficient fine-tuning method, inserting small trainable matrices alongside frozen layers to achieve specialization at ~1–5% of the compute cost. Gemma 4's Apache 2.0 license explicitly allows commercial fine-tuning and redistribution of fine-tuned derivatives — a significant enterprise advantage.
The most capable AI models available at any given time — those operating at or beyond the boundary of what was previously possible. In 2026, frontier models include GPT-5 (OpenAI), Claude Opus 4.6 (Anthropic), Gemini Ultra (Google), and Grok 3 (xAI). The EU AI Act uses training compute (threshold: 10²⁵ FLOPs) as a proxy for frontier status, triggering the most stringent GPAI (General Purpose AI) obligations. Frontier models are typically accessible only through paid APIs; their training costs hundreds of millions of dollars. They are distinct from open-weight models like Llama or Gemma, which may approach (but rarely match) frontier capability at a fraction of the cost. The frontier advances roughly every 6–12 months as new models are released. What was frontier yesterday is commodity today — a key dynamic driving the open-source ecosystem.
AI models that produce new content — text, images, audio, video, code, 3D models — rather than merely classifying or predicting from existing data. The term encompasses LLMs (text/code generation), diffusion models (image/video/audio generation), and multimodal models that handle multiple types simultaneously. Generative AI is the primary driver of the 2022–2026 AI boom. The global generative AI market was ~$37B in 2024 and is projected to reach $220B by 2030 (ABI Research, 2025). Key distinction: discriminative AI (e.g., spam filters, face recognition) classifies inputs into categories. Generative AI produces novel outputs.
When an AI model generates a plausible-sounding but factually incorrect output — fabricating citations, inventing statistics, misquoting sources, or stating false facts confidently. Not "lying" in any intentional sense; the model has no internal truth-checking mechanism — it generates what statistically follows from the context, and incorrect continuations can have very high probability. Examples: A legal AI cites a case that doesn't exist; a medical AI confidently states an incorrect drug dosage; a research assistant fabricates a journal citation. Mitigation: RAG (grounding responses in retrieved documents), tool use (web search integration), reasoning chains (extended thinking), and calibration training (teaching models to express uncertainty). Hallucination rates vary dramatically by domain, model generation, and task type. They have declined substantially with each model generation but remain non-zero across all frontier models.
The dominant open-source AI model distribution and collaboration platform — often called "the GitHub of AI." As of 2026, HuggingFace hosts 500,000+ AI model repositories, 50,000+ datasets, and provides tools (the Transformers library, the Datasets library, Inference API) that have become the de facto standard for open-source AI development. Companies release open models here (Meta's Llama, Google's Gemma, Mistral, DeepSeek, Microsoft's Phi). Researchers share datasets and fine-tunes. The platform enables the open-source ecosystem to function as a coherent community rather than fragmented individual releases. huggingface.co.
A retrieval technique where, instead of directly embedding the user's query for vector search, an LLM first generates a hypothetical answer to the query, and that hypothetical answer is then embedded for retrieval. The intuition: the hypothetical answer looks more like the documents that contain the real answer than the original question does, leading to better retrieval. For example, a legal query like "What are the elements of negligence?" produces a poor embedding for searching case law (too abstract), but a generated paragraph describing negligence law provides a richer semantic target. Introduced by Gao et al., 2022. Particularly effective for narrow or highly technical domains where query reformulation is difficult.
The process of using a trained AI model to generate outputs — as distinct from training, which is the process of building the model. Every time you send a prompt to ChatGPT or Claude, you are running inference. Inference is less computationally intensive than training but becomes significant at scale: OpenAI processes approximately 1 billion queries per day. Inference cost is typically measured in dollars per million tokens (input and output priced separately). Inference costs have fallen dramatically as optimization techniques (quantization, speculative decoding, KV cache sharing) improve efficiency. Inference-time scaling is the emerging paradigm of allocating more compute at inference time (letting the model "think longer") rather than always training larger models.
A neural network trained on massive text corpora (and increasingly multimodal data) to predict and generate language. "Large" refers to the number of parameters — mathematical weights — typically ranging from 7 billion to over a trillion. LLMs are the core technology behind ChatGPT, Claude, Gemini, Llama, and most consumer AI products. They are autoregressive: they generate output one token at a time. Training an LLM involves: (1) pre-training on general data (unsupervised), (2) instruction tuning (supervised fine-tuning on task examples), and (3) alignment (RLHF or Constitutional AI to make behavior safe and helpful). LLMs are not databases — they do not "look up" facts; they generate statistically probable continuations, which is why they can hallucinate. Anthropic Research | OpenAI Research.
An open protocol introduced by Anthropic (November 2024) that standardizes how AI models communicate with external tools, data sources, and services. Before MCP, every AI application required custom API integrations for each tool it used (a calendar tool, a database, a code executor, etc.) — creating enormous duplication. MCP provides a universal "plug" standard: any tool that implements MCP can be used by any MCP-compatible AI agent. Analogous to USB standardizing physical device connections. As of 2026, MCP is rapidly becoming an industry standard, adopted by OpenAI, Google, and major enterprise software vendors. MCP enables the multi-agent, multi-tool agentic workflows described throughout this report. modelcontextprotocol.io.
A neural network architecture where the model is divided into many "expert" sub-networks, but only a small subset of them are activated for any given input. A routing layer decides which experts to use based on the input. Why it matters: MoE allows a model to have a very large total parameter count while only using a fraction of those parameters per inference — dramatically reducing computation cost relative to a dense model of the same total size. GPT-4, Mixtral 8x7B, Gemma 4's 26B variant, and Meta's Llama 4 Scout all use MoE architectures. Monolithic model (the alternative) uses all parameters for every token. Trade-off: MoE models can be harder to fine-tune and require specialized infrastructure, but offer superior inference efficiency at scale. A "26B A4B" MoE model has 26B total parameters but only activates ~4B (A4B = "active 4 billion") per token.
A neural network that uses all its parameters for every inference call — as opposed to a Mixture of Experts model which only activates a subset. "Dense" model is the more common technical synonym. Most early LLMs (GPT-3, original Claude, Llama 1/2) are monolithic. Monolithic models are simpler to train and fine-tune, but become computationally expensive at large parameter counts since every parameter is engaged for every token. Gemma 4's 31B model is a dense (monolithic) model; its 26B variant is MoE. The practical implication: a 31B dense model requires all 31B parameters to be loaded in GPU memory; a 26B MoE model with 4B active parameters can be significantly more memory-efficient per token at inference.
AI models capable of processing and/or generating across multiple data types simultaneously — text, images, audio, video, documents, and code. A multimodal model can read a photograph and answer a question about it; listen to audio and transcribe and analyze it; look at a chart and describe the trends. Examples (2026): GPT-4o (text + vision + audio), Claude Opus 4.6 (text + vision + documents), Gemma 4 (text + vision + audio natively), Gemini Ultra (text + vision + audio + video). Multimodality is now the expectation for frontier models; single-modality text-only models are increasingly relegated to cost-optimized use cases. The integration is architectural: truly multimodal models process all modalities in a shared representation space, rather than bolting a vision encoder onto a text model post-hoc.
A computational architecture loosely inspired by biological neurons. Layers of mathematical nodes (neurons) transform input data through a series of learned transformations, with each layer extracting increasingly abstract representations. A neural network "learns" by adjusting the strength (weight) of connections between neurons based on feedback from training data. Shallow networks have 1–2 hidden layers. Deep networks (deep learning) have many layers — modern LLMs have dozens to hundreds of transformer layers. The mathematical operations are primarily matrix multiplications and non-linear activation functions, which is why GPUs (optimized for parallel matrix operations) dominate AI compute.
AI models whose underlying parameters (weights) are publicly released, allowing anyone to download, run, modify, fine-tune, and deploy them. "Open-weight" is technically more precise: the weights are open, but training data and full training code may not be. "Open-source" implies full transparency of the entire development stack. Key open-weight models (2026): Meta Llama 3.x (custom permissive license), Google Gemma 4 (Apache 2.0 — the most permissive license in this class), Mistral (Apache 2.0), DeepSeek R1 (MIT license), Microsoft Phi-4, Alibaba Qwen 3.5. Apache 2.0 (used by Gemma 4 and Mistral) is considered the cleanest license for commercial deployment — no restrictions on use, modification, or redistribution; standard open-source legal precedent; no custom clauses requiring legal review. Open-weight models enable on-premises deployment (data never leaves your infrastructure), custom fine-tuning for proprietary data, and elimination of per-token API costs at scale. See also: Google Gemma official page.
The numerical values (weights and biases) that define a neural network's learned behavior — the product of training. A model with "70 billion parameters" has 70 billion individual numerical values that collectively encode everything the model has learned. Parameters are the model itself; they are what's stored, shared, and loaded into GPU memory. Larger parameter counts generally correlate with greater capability, but the relationship is non-linear and depends heavily on training quality, data diversity, and architecture efficiency. MoE architectures decouple total parameters (what's stored) from active parameters (what's computed per token), making parameter count an incomplete proxy for both capability and cost. OpenAI Scaling Laws.
A prompt is the input provided to an AI model — the text (or image, audio, etc.) that initiates a generation. Prompts include: the user's question or instruction, system prompts (developer instructions that shape model behavior), conversation history, and injected context (retrieved documents, tool outputs). Prompt engineering is the practice of crafting inputs that reliably elicit better outputs. Key techniques: (1) clear role and task specification; (2) few-shot examples (showing the model what good output looks like); (3) explicit instruction to reason step by step (CoT); (4) XML tags or structured formatting to separate instructions from content; (5) specifying output format explicitly. Effective prompt engineering can increase model performance on a given task by 20–50% without any model change. See Anthropic Prompt Engineering Guide.
The process of reducing the numerical precision of a model's parameters (weights) from high-precision floating-point (e.g., 32-bit or 16-bit floats) to lower-precision representations (e.g., 8-bit integers, 4-bit integers). Why it matters: A 70B-parameter model stored in full 16-bit precision requires ~140GB of GPU memory — beyond consumer hardware. The same model quantized to 4-bit requires only ~35GB, making it runnable on a workstation. Quantization trades a small amount of output quality for dramatic reductions in memory and compute cost. Common formats: GGUF (for llama.cpp local inference), GPTQ, AWQ, NVFP4. Gemma 4 is released with an NVFP4 quantized checkpoint by NVIDIA, enabling high-throughput inference on modern GPUs. Quantization is what makes open-source AI runnable on consumer hardware without cloud APIs.
A technique where a complex user query is broken down into multiple simpler sub-queries before retrieval. For example, "How has Tesla's gross margin changed since they began selling the Cybertruck, and how does that compare to legacy automakers?" might be decomposed into: (1) Tesla gross margin 2022–2026, (2) Cybertruck launch date and initial sales volume, (3) gross margins for Ford, GM, Stellantis 2022–2026. Each sub-query is retrieved independently, and the results are combined before the LLM synthesizes a final answer. Query decomposition dramatically improves performance on multi-faceted questions that no single retrieved chunk can answer completely. Also known as "step-back prompting" or "sub-question generation" in some frameworks.
A technique that combines information retrieval with language model generation. Rather than relying solely on the model's trained knowledge (which has a cutoff date and can hallucinate specific facts), RAG first retrieves relevant documents from a database and injects them into the model's context before generating a response. Workflow: (1) User asks a question → (2) Question is embedded and used to query a vector database → (3) Most semantically similar document chunks are retrieved → (4) Retrieved chunks + original question are sent to the LLM → (5) LLM generates a response grounded in the retrieved information. RAG is how enterprises deploy LLMs on private, proprietary data without expensive fine-tuning. It also provides citations, enables real-time knowledge (by updating the database without retraining the model), and reduces hallucination on factual queries. See also: Chunking, Embedding, HyDE, ColBERT, Chain of Note.
In RAG pipelines, the process of taking an initial set of retrieved document chunks (from vector search) and re-ordering them by relevance to the user's query using a more powerful (but slower) cross-encoder model, before passing the top results to the LLM. Vector search retrieves quickly but imperfectly; re-ranking refines the selection. A typical pipeline: retrieve top-100 candidates with fast vector search → re-rank with cross-encoder → pass top-5 to LLM. Re-ranking is a high-leverage optimization that substantially improves RAG answer quality in production systems. Common re-ranking models: Cohere Rerank, BGE-Reranker, cross-encoders via SBERT.
A class of LLM trained to generate explicit intermediate reasoning steps (a "reasoning chain" or "chain of thought") before producing a final answer, rather than answering directly. This "thinking before speaking" dramatically improves accuracy on complex logical, mathematical, coding, and multi-step planning tasks. Examples: OpenAI o1/o3/o4-mini, Claude Opus 4.6 in "extended thinking" mode, DeepSeek R1. Reasoning models typically use more compute at inference time (more tokens generated per query), making them slower and more expensive than standard models — but dramatically more accurate on hard problems. The trade-off: use a fast standard LLM for simple tasks; use a reasoning model for complex ones. Inference-time scaling is the principle that allocating more computation at inference time (rather than training larger models) yields quality improvements — reasoning models are the primary implementation of this principle. OpenAI o3 | Claude Extended Thinking.
The process of finding and fetching relevant information from a knowledge store in response to a query. In AI systems, retrieval is the mechanism that grounds model responses in specific documents, databases, or real-time data sources. Lexical retrieval (keyword matching, BM25): finds documents containing the same words as the query. Dense/semantic retrieval (embedding-based): finds documents with semantically similar meaning, even if exact words differ. Hybrid retrieval: combines both for broader coverage and higher precision. The retrieval quality is the single largest determinant of RAG system performance — garbage retrieved → garbage generated, regardless of how powerful the LLM is.
The training technique that transforms a base pretrained language model into an assistant that is helpful, honest, and safe. Process: (1) humans rank pairs of model outputs from best to worst; (2) a "reward model" is trained on these preferences; (3) the LLM is then trained via reinforcement learning to maximize the reward model's score. RLHF is what turned GPT-3 (a raw completion model that would write anything) into ChatGPT (an assistant that follows instructions and declines harmful requests). Limitations: reward models can be "hacked" by the LLM finding outputs that score high but don't reflect genuine quality; human raters have inconsistencies; RLHF doesn't transfer well to domains where humans can't evaluate outputs (e.g., highly technical code). Constitutional AI (Anthropic's approach) uses AI feedback (RLAIF) in place of or alongside human feedback to scale alignment. See OpenAI RLHF research.
Supervised learning: The model is trained on labeled examples — input/output pairs where the "correct" answer is provided. The model learns to predict the label from the input. Examples: spam classifier (email → spam/not-spam), image classifier (photo → dog/cat), speech recognition (audio → transcript). Requires human labeling effort. Used extensively in fine-tuning LLMs with instruction/response pairs, and in training reward models for RLHF.
Unsupervised learning: The model is trained on raw, unlabeled data and must discover patterns, structure, or representations on its own, without being told what to find. LLM pre-training is technically a form of self-supervised learning (a subtype of unsupervised) — the model predicts the next token using the text itself as the label. Diffusion model training is also self-supervised: the model learns to reverse a noise process it controls.
Self-supervised learning is the dominant paradigm for modern foundation models — it scales to internet-scale data without human labeling of each example. Supervised fine-tuning (SFT) is then applied on top to specialize behavior using curated, labeled examples. Practical implication: Most "training data" concerns in AI relate to the unsupervised pre-training phase (web scraping, copyright), while alignment quality concerns relate to the supervised/RLHF fine-tuning phase.
A hypothetical AI that surpasses human cognitive ability in every intellectually relevant domain — not merely matching human performance (AGI) but dramatically exceeding it, including the ability to improve its own intelligence recursively. First systematically analyzed by Nick Bostrom in Superintelligence (2014). METR's February 2026 research models this arriving on "fast" AI timelines as early as mid-2031 under conservative compute growth assumptions. The strategic concern: a superintelligent system would by definition be better at self-improvement than humans are at constraining it, creating an alignment challenge that must be solved before the system reaches that capability level — not after. Sam Altman (OpenAI) has described superintelligence as potentially arriving within a few years. No superintelligent system exists as of April 2026. This term is included because it shapes frontier AI strategy and investment decisions regardless of when or whether it arrives.
A parameter (typically 0–2) that controls the randomness/creativity of a model's output. At temperature 0, the model always picks the single most probable next token — deterministic, consistent, but potentially repetitive and literal. At high temperature (e.g., 1.5), the model samples more broadly from the probability distribution — more creative, varied, and sometimes surprising, but also more prone to errors and inconsistency. Practical guidance: Use low temperature (0–0.3) for factual Q&A, code generation, structured data extraction, and tasks where correctness matters above all. Use medium temperature (0.5–0.8) for general conversation. Use higher temperature (1.0+) for creative writing, brainstorming, or poetry. Temperature does not affect what the model "knows" — only how it samples from what it knows.
A token is the fundamental unit of text that LLMs process — roughly corresponding to ¾ of a word in English (a common approximation: 100 words ≈ 75 tokens; 1,000 tokens ≈ 750 words). Tokens are not always full words: "unbelievable" might be 3 tokens (un / believ / able). Common words are typically 1 token; rare words may be multiple tokens. Tokenization is the process of splitting raw text into tokens before the model processes it. This is done by a tokenizer trained alongside the model (e.g., byte-pair encoding, BPE). Why tokens matter: API pricing is per token (both input and output); context windows are measured in tokens; the model sees tokens, not characters or words. In April 2026, frontier LLM pricing ranges from $0.10 to $5 per million input tokens. OpenAI Tokenizer Tool lets you visualize how text is tokenized.
The neural network architecture that underlies virtually all modern LLMs and many other AI systems. Introduced in the landmark 2017 Google Brain paper "Attention Is All You Need". The key innovation: replacing recurrent processing (which handled text sequentially, accumulating errors) with attention mechanisms that process all tokens in parallel, attending to any other token regardless of distance. Transformers enabled scaling to massive datasets and model sizes in ways that were not feasible with prior architectures. GPT, Claude, Gemini, Llama, Gemma, Mistral, and DeepSeek are all transformer-based models. Non-transformer architectures (Mamba, RWKV) are being researched as potentially more efficient alternatives for very long contexts, but as of 2026, transformers remain dominant.
See: Supervised vs. Unsupervised Learning above. Unsupervised learning in brief: training on unlabeled data to find structure, patterns, or representations. The pre-training phase of modern LLMs (predicting the next token on internet text) is technically self-supervised learning — a variant of unsupervised learning where the labels are derived from the data itself. K-means clustering and PCA are classical unsupervised algorithms. In modern AI, self-supervised pre-training has replaced both classical supervised and unsupervised approaches for most foundation model development.
Vectorization is the process of converting data (text, images, audio) into dense numerical vectors (embeddings) that represent their semantic content. Vector databases (e.g., Pinecone, Weaviate, Chroma, pgvector, Qdrant) are optimized storage and retrieval systems for these high-dimensional vectors. They support "approximate nearest neighbor" (ANN) search — finding the vectors (and thus the documents) most semantically similar to a query vector, at scale. In a RAG pipeline: documents are vectorized (embedded) → stored in a vector database → at query time, the query is embedded → top-K most similar vectors are retrieved → those documents are passed to the LLM. Vector databases are a critical piece of modern AI infrastructure, enabling semantic search over millions of documents in milliseconds. Chroma Docs | Pinecone Learning.
The practice of anchoring AI outputs to verifiable sources rather than relying solely on the model's parametric (trained) knowledge. Techniques include RAG (injecting retrieved documents), tool use (web search integration), and citation requirements. A "grounded" response is one where every key claim can be traced to a specific source. Grounding is the primary engineering countermeasure to hallucination. Without grounding, the model answers from memory (which can be wrong, outdated, or hallucinated). With grounding, the model answers from retrieved evidence (which can be verified). Grounding is particularly critical in legal, medical, financial, and scientific applications where factual accuracy is non-negotiable.
An AI system that goes beyond single-turn question answering to autonomously plan and execute multi-step tasks — browsing the web, writing and running code, sending emails, managing calendars, querying databases — without requiring human instruction at each step. The user provides a goal; the agent determines the steps and executes them. Single-agent systems use one LLM with access to tools. Multi-agent systems orchestrate multiple specialized AI models — one for research, one for writing, one for code review — coordinated by a supervisor model. The Anthropic 2026 Agentic Coding Report (cited throughout the main report) documents agents completing entire software features in hours. Key limitations (2026): "Fully delegated" tasks still represent only 0–20% of developer workflows; human oversight remains essential for high-stakes work. Context rot is the primary failure mode in long agentic task chains. See also: MCP (the standard enabling agent-tool communication).