A new study on AI agent productivity reveals that smarter context management — not bigger models — is the real unlock for reliable, cost-efficient automation.
The AI Agent Productivity Problem Nobody Talks About
Here's the dirty secret of deploying AI agents in real workflows: they drown in their own memory. Every tool call, every API response, every back-and-forth gets stuffed into the context window — and eventually the agent loses the plot, repeats itself, or just fails.
Researchers at Microsoft tested this exact problem using GPT-5 agents automating expense itemisation in Dynamics 365. With full conversation history retained, the agent completed 71% of tasks but burned through 1.48 million tokens and took nearly 15 hours per benchmark run. That's expensive and slow.
The Context Engineering Fix That Actually Works
The winning approach combined two simple techniques: prune the context to the last 5 tool interactions, then add a compact rolling summary of what happened before. The result? 91.6% task completion, token usage slashed to 553k (a 63% reduction), and runtime cut to under 6 hours.
Think of it like a surgeon's briefing — you don't hand them the patient's entire life history mid-operation. You give them a crisp summary plus the last few critical readings. The agent stays sharp because it's not wading through stale noise.
The no-context baseline, for reference, managed a humbling 8% completion rate. Context design isn't a nice-to-have — it's the difference between a useful agent and an expensive toy.
What This Means for AI Agent Productivity Learners
If you're building or evaluating AI agents for any workflow — customer support, data processing, code generation — context engineering is now a core skill, not an advanced topic. The gap between a 71% and 91% success rate isn't a better model; it's better architecture around the model you already have.
Understanding how to structure prompts, manage tool call history, and design summarisation pipelines is exactly what separates hobbyist AI use from production-grade deployment. Our Build Your First RAG Pipeline course covers the retrieval and memory patterns that underpin this kind of context-aware design, and Hermes Agent Essentials goes deeper on building agents that stay reliable over long task horizons.
The practical takeaway you can apply today: if your AI agent is underperforming, don't reach for a bigger model first. Audit what's in its context window. You might just be handing it too much history and not enough signal.