Over the last two years of the AI craze, developers have been focused on creating ideal prose prompts. LLM models were regarded as unreliable oracles to be flattered or coerced into a correct answer. The early LLM software engineering was based on complicated English templates, crying out “take a deep breath,” and fragile parsing loops. This is an outdated paradigm that was conceived in an age of small context window and costly inference. With context windows growing to millions of tokens and API prices dropping drastically, the challenge is no longer how to communicate with models, but how to programmatically get them to consume data. Writing literary prompts is no longer the era, and it’s been replaced by systematic discipline of context engineering.
The Hidden Physics of Attention: “Lost in the Middle”
To see why the traditional prompt engineering is not working well, we will have to explore the physical constraint of attention in today’s transformer architectures. Early models had context windows as small as four thousand tokens, so developers had to make do with brevity. Today, models such as Claude 3.5 Sonnet and Gemini 1.5 Pro have context windows of hundreds of thousands to millions of tokens. But the capability to consume millions of tokens doesn’t necessarily mean being able to reason uniformly across them. As mentioned in their study paper “Lost in the Middle: How Language Models Use Long Contexts” from 2023, Nelson F. Liu and his colleagues showed that LM are biased towards the start and end of input sequences. Key information imbedded in a long context degrades catastrophically, sometimes to near random chance.
Whitepaper: The Future of Generative AI in Web Development
This whitepaper explores how Generative AI is transforming web development by automating coding, design, and testing workflows, while enabling hyper-personalized, intelligent digital experiences for businesses.
The Structural Limits of Self-Attention
This “lost in the middle” phenomenon is due to the nature of the self-attention mechanism in transformers. Self-attention involves computing an attention score for each token by all of the other tokens. In longer passages, this ratio falls off exponentially and the model has difficulty separating the important facts from the extraneous ones. Just throwing a whole codebase, document repository, or chat history into a context window is an invitation to failure. The model will prioritize the system prompt on the top and the final query on the bottom, which are the ones that are vital, while ignoring critical documentation that is in the middle. Thus, the task of creating a reliable AI feature is now a database and code orchestration issue.
Defining Context Engineering: The New AI Paradigm
According to Andrej Karpathy, who was the Director of AI at Tesla, context engineering is the fine art and science of programmatically populating the context window with the appropriate information to pass on to the next step. It’s a science and there’s a lot of control that goes into the data that’s being selected, formatted, and updated. It is an art and science, and a lot of knowledge about the model’s behaviour and attention limitations. However, the software systems that developers need to build are to be dynamic data pipelines, instead of manually adjusting the adjectives. These pipelines collect, filter, deduplicate and compress data and then put it together into a coherent context packet. The idea is no longer to make the prompt look pretty to humans, but to make the input really dense with signal for the model and structured.
Prompt Caching: The Economics of the KV Cache
Prompt caching is one of the most significant innovations in context engineering, reshaping the economic landscape for long-context applications. Providers like Anthropic, OpenAI, and Google let the model’s KV cache be carried over to subsequent API calls, thereby enhancing its performance. Traditionally, on every API request the model has to reprocess all of the tokens which is expensive and slow. The computed math states of a static prefix can be stored in memory via prompt caching. The next time a request is made with this prefix, the model will skip the compute stage, and save up to ninety percent on response time and tokens. For instance, Anthropic’s pricing for Claude 3.5 Sonnet is a ninety per cent discount on cached tokens, which is a fraction of the normal rate.
Architectural Rules for Stable Prefix Caching
These are huge economic and performance gains, but developers need to create their context strings in a KV-cache aware manner. The higher hit ratio in caching systems is dependent on the stability and identity of prefixes. Since the caching is based on prefix, changing any character in the prompt means all of the part of the prompt after the change is invalidated. So it’s bad to have the prompt for a user be their last query, or to have the dynamic conversation history somewhere in the middle of the reference text. The context engineer needs to distinguish between the static (unchanging) and dynamic (changing) elements of the prompt. The context stream must have the static parts like system instructions, database schemas, API definition etc. placed at the absolute beginning. If it is dynamic information like the user’s most recent message, it is important that it be added at the end so that the static block remains cached.
Beyond Content Dumps: Dynamic Retrieval and Intelligent RAG
The ultimate goal of Content Dumps is to go beyond traditional content dumps to implement dynamic retrieval and intelligent RAG. In addition to prefix stability, context engineering needs advanced dynamic retrieval capabilities, to avoid wasting token budget. Piling the whole multi-megabyte codebase into the context window is an expensive and lazy approach, resulting in strong attention degradation. Rather, developers need to invest in dynamic retrieval systems, like sophisticated Retrieval Augmented Generation (RAG). A context pipeline with vector databases, embedding models and keyword search can do this by extracting only the semantically relevant pieces of text which correspond to the user’s query. This kind of dynamic retrieval can be further optimized by ranking models, such as Cohere Rerank, which performs a secondary focused pass and scores and sorts the retrieved chunks. The developer can then order these chunks according to their relevance, meaning high-value documents can be positioned at the front or end of the prompt, directly combating the “lost in the middle” effect, and putting high-signal chunks at the front of the prompt.
Token-Level Context Compression and Smart Routing
Developers will also need to use token-level context compression to further reduce token overheads and increase reliability. Small, fast language models such as LLMLingua from Microsoft Research can analyse prompt strings and dynamically prune low-entropy words, redundant phrases and boilerplate syntax in open-source libraries. These compression engines analyze prompts to prune out any tokens that yield the lowest overall gain in semantic content, and compress prompts by up to half their size without any perceivable decrease in accuracy. This compression remarkably reduces API costs and sends a very compressed, high-signal input to the high-level reasoning model. Combine this with smart routing (a small, fast model first being used to see if a larger, more expensive model is needed), and AI goes from prototype to enterprise system, ready to scale.
Conclusion: The Era of Code Orchestration
In fact, the transition from prompt engineering to context engineering is the natural evolution of the AI engineering discipline. Applications with high latencies, costs that are out of the user’s control, and frequent hallucinatory behaviors will plague developers who still depend on writing long prose-like instructions. By contrast, software engineers who embrace code orchestration, prompt caching, dynamic retrieval pipelines, and systematic context pruning will create highly reliable, blindingly fast, and cost-effective AI features. Say goodbye to wordy paragraphs and context dumps. The future of AI development isn’t in English prose, it’s in code that programs the minimum high-signal tokens to the underlying models.
Whitepaper: The Future of Generative AI in Web Development
This whitepaper explores how Generative AI is transforming web development by automating coding, design, and testing workflows, while enabling hyper-personalized, intelligent digital experiences for businesses.