Definition
The context window is the maximum number of tokens that a large language model can process in a single interaction. This limit encompasses everything the model sees: the system prompt, conversation history, any retrieved documents or context, the user query, and the generated response. A token is roughly three-quarters of a word in English, so a 128,000-token context window can handle approximately 96,000 words of combined input and output.
Context windows vary significantly across models. Earlier models had windows of 4,000 to 8,000 tokens, while modern models offer 128,000 to over 1,000,000 tokens. However, larger context windows come with trade-offs: they increase latency, cost more per query, and research shows that model performance can degrade for information placed in the middle of very long contexts (the "lost in the middle" phenomenon).
Why It Matters for Product Managers
Context window size is one of the most practically important constraints PMs face when designing AI features. It determines whether a chatbot can remember an entire conversation, whether a RAG system can include enough document context for accurate answers, and whether a summarization feature can process an entire document in one pass. Understanding these limits helps PMs design features that work reliably rather than failing unpredictably when inputs exceed the window.
From a cost perspective, context window usage directly impacts per-query expenses since most LLM APIs charge per token. PMs must balance the desire for more context (which improves quality) against the cost of processing that context at scale. This trade-off shapes decisions about chunking strategies, conversation pruning, and which information to include versus exclude from each request.
How It Works in Practice
Common Pitfalls
Related Concepts
Context window size is a fundamental constraint of every Large Language Model (LLM), directly shaping Prompt Engineering decisions about what to include in each request. Retrieval-Augmented Generation (RAG) architectures must fit retrieved documents within these token limits to ground model outputs effectively.