Context engineering is the systematic approach to structuring information provided to large language models to improve output quality, reduce costs, and enable reliable performance. While prompt engineering focuses on how you ask the model to do something, context engineering determines what information the model has access to when responding.
Why Context Engineering Matters
Foundation models like GPT-4 and Claude have context windows of 128K-200K tokens (roughly 100,000-150,000 words). You can fit entire codebases, documentation sets, or conversation histories in a single request. But filling the context window doesn't guarantee good results.
Models perform better with structured, relevant context. Random or excessive information degrades output quality through:
Noise dilution: Important details get lost in irrelevant information. A legal AI searching 500 pages of contracts performs worse than one searching 15 relevant pages because the model weights all provided information.
Attention limits: Transformer models distribute attention across the context window. More tokens mean less attention per token. Critical information in the middle of large contexts gets lower weight than information at the beginning or end.
Cost explosion: Every token in the context costs money. A support bot that sends 10,000 tokens of documentation per query when 800 tokens would suffice wastes 92% of its inference budget.
Context engineering solves these problems through retrieval strategies, compression techniques, and information architecture.
The Three Layers of Context Engineering
Layer 1: Retrieval (what to include)
Semantic search: Use embedding models to find contextually similar information. Instead of keyword matching, semantic search understands meaning. A query about "refund policy" retrieves relevant sections even if they use terms like "money-back guarantee" or "return process."
Hybrid search: Combine semantic similarity with keyword matching and metadata filters. Product documentation might filter by version number, then rank by semantic relevance, then keyword boost exact technical terms.
Contextual retrieval: Consider conversation history, user context, and task type when selecting information. A question from an enterprise customer retrieves different documentation than the same question from a free user.
Layer 2: Structure (how to organize)
Hierarchical formatting: Present information in order of importance. Start with direct answers, then supporting details, then edge cases. Models weight earlier context more heavily in their responses.
Chunking strategy: Break large documents into coherent sections. Chunk size matters: too small (100 tokens) loses context, too large (2,000 tokens) includes irrelevant information. Optimal chunks are 400-800 tokens with semantic boundaries (paragraphs, sections).
Metadata enrichment: Tag context with source, confidence, date, or category. This helps models assess relevance and cite sources accurately.
Layer 3: Compression (reducing token count)
Summary generation: Pre-process large documents into summaries. Include full text only when the model needs granular details.
Template-based extraction: For structured data, extract key fields into templates rather than including full documents. Contract analysis might extract parties, dates, values, and obligations rather than sending 50-page PDFs.
Prompt caching: Anthropic's Claude offers prompt caching where repeated context (system instructions, knowledge bases) is cached at 90% cost reduction. Structure context so static portions are cacheable and only dynamic portions (user queries, recent history) change per request.
Context Window Management
Foundation models have maximum context lengths. Managing this budget is critical:
Token allocation strategy:
- System instructions: 200-500 tokens (static, cacheable)
- Retrieved knowledge: 2,000-5,000 tokens (dynamic, optimized through retrieval)
- Conversation history: 1,000-3,000 tokens (compressed or summarized for long conversations)
- User query: 100-500 tokens
- Output budget: 500-2,000 tokens
Handling context overflow: When context exceeds limits, use progressive summarization. Compress older conversation turns into summaries while keeping recent turns verbatim. Legal AI might summarize cases from 6+ months ago while preserving full text of recent precedents.
Context freshness: Recent information is usually more relevant. Time-decay weighting ranks newer documents higher in retrieval unless explicitly searching historical information.
Retrieval-Augmented Generation (RAG)
RAG is the most common context engineering pattern: retrieve relevant information from a knowledge base and inject it into the prompt. The architecture:
- User submits a query
- Query is embedded into a vector (mathematical representation of meaning)
- Vector database returns semantically similar documents
- Retrieved documents are structured and inserted into the prompt
- Model generates a response using both the query and retrieved context
- Response includes citations to source documents
RAG vs. fine-tuning trade-offs:
- RAG: Works for knowledge that changes frequently, transparent sourcing, easier to debug, scales to large knowledge bases
- Fine-tuning: Better for style adaptation, faster inference (no retrieval overhead), works when knowledge fits in model weights
Most production systems use both: fine-tune for domain language and reasoning patterns, use RAG for specific factual knowledge.
Common Context Engineering Patterns
Few-shot learning: Include 3-5 examples of input-output pairs in the context. This teaches the model desired behavior without fine-tuning. Customer support classification might show 5 examples of tickets correctly routed to billing, technical, or sales teams.
Chain-of-thought scaffolding: Provide reasoning steps in the context. Instead of asking "Analyze this contract," show an example analysis that breaks the task into steps: identify parties, extract obligations, highlight risks, assess compliance.
Constraint injection: Embed rules directly in context. "Only use information from the provided documents. If unsure, say 'I don't know.' Cite page numbers for all claims." This reduces hallucination risk.
Dynamic context assembly: Adjust context based on query complexity. Simple questions get minimal context. Complex multi-part questions trigger retrieval from multiple sources with hierarchical organization.
Measuring Context Engineering Effectiveness
Track these metrics to optimize context strategies:
Quality metrics:
- Hallucination rate (claims not supported by provided context)
- Citation accuracy (correct source attribution)
- Answer completeness (addresses all parts of multi-part questions)
Cost metrics:
- Average tokens per query (lower is better, if quality holds)
- Cache hit rate (for prompt caching systems)
- Retrieval latency (time to fetch and structure context)
Relevance metrics:
- Retrieval precision (% of retrieved documents actually used in response)
- Context utilization (which parts of context the model attends to)
Compare these against baselines: naive full-document retrieval, keyword search, or zero-shot prompting without context.
Advanced Techniques
Conditional context expansion: Start with minimal context. If the model's response lacks detail, automatically retrieve more information and regenerate. This saves tokens on simple queries while handling complex ones.
Multi-hop retrieval: For complex questions requiring multiple sources, retrieve iteratively. First retrieval answers part of the question, then use that partial answer to inform second retrieval. Legal research might first find relevant statutes, then use those to find case precedents.
Context pruning: Remove low-value sentences from retrieved documents using secondary models. A summarization model identifies which sentences contribute to answering the query and discards the rest.
Negative context: Explicitly tell the model what not to include. "Do not use information from archived versions. Do not mention deprecated features." This prevents contamination from outdated knowledge.
Tools and Implementation
Vector databases: Pinecone, Weaviate, Qdrant, Chroma for semantic search and retrieval
Embedding models: OpenAI text-embedding-3-small, Cohere embed-v3, or open-source Sentence Transformers
Chunking libraries: LangChain, LlamaIndex for document processing and context assembly
Evaluation frameworks: RAGAS, TruLens for measuring retrieval quality and context effectiveness
Most AI product teams invest more in context engineering than prompt engineering. A well-engineered context system enables mediocre prompts to produce excellent results. Poor context makes even sophisticated prompts fail.