Context

Context is the model's working memory — a fixed-size window that holds everything the model can "see" at once. Think of it as a desk with limited surface area: you can spread out papers, but eventually you run out of space and need to put some away. What's on the desk right now is all the model has to work with.

How It Works

Every large language model has a context window measured in tokens — typically ranging from 4K to 128K or more. Everything the model needs goes inside this window: the system prompt that sets its behavior, the full conversation history, any documents or code you include, and room for the model's own response.

As the diagram shows, messages fill the window from top to bottom. When the window reaches capacity, the oldest content gets truncated — silently discarded to make room for new input. The model has no memory outside this window. If a message was removed, the model doesn't know it ever existed. There's no hidden storage, no long-term recall.

This is why "context-aware" is a meaningful term: it means the model is using everything currently in its window to shape the response. A model with more context can sustain longer conversations, process larger documents, and maintain coherence across extended interactions. But context isn't free — every token in the window consumes compute, which means larger windows cost more per request.

Why It Matters

Understanding context limits is essential for building effective AI workflows. Long conversations will eventually lose their earliest messages, meaning the model may "forget" instructions you gave at the start. This is why chatbots sometimes seem to lose track of what you told them twenty messages ago — those messages were silently dropped from the window.

RAG systems (covered in Grounding) need to fit retrieved documents inside the window alongside the question and the system prompt. If your documents are too large, you need to chunk them, summarize them, or select only the most relevant passages.

From a cost perspective, context is the primary driver of API pricing. Sending 100K tokens of context for a simple question is wasteful. Smart context management — keeping only what's relevant and trimming what isn't — is a core skill for building AI-powered applications that are both effective and cost-efficient.

How It Works​

Why It Matters​

How It Works

Why It Matters