75% Lower AI Costs, Zero Code Changes

Every AI tool your team uses talks to the LLM alone. I built the open-source layer that governs them all-85% cache hits, on a benchmark you can run in 60 seconds.

Jun 14, 2026

I’ve spent twenty-five years shipping software that has to work on Monday morning, not just in a demo. Lately I keep hitting the same failure and it isn’t a model problem. It’s a plumbing problem.

The facts that made it undeniable: Uber burned through its entire 2026 AI budget in four months. Amazon shut down an internal leaderboard after engineers started running pointless agent loops to game their scores. A single recursive loop, left alone, can quietly run up a $47,000 API bill before anyone notices. And only a sliver of companies can actually see what their AI tools are spending.

It’s tempting to call this a discipline problem. It isn’t. It’s architecture.

Look at how a normal engineering org actually uses AI today. Claude Code has its own connection to the model. Copilot has its own. The ChatGPT tab open in someone’s browser has its own. Your production agents have their own. Four tools, four private pipes to the same handful of APIs and not one of them shares a cache, a budget, or an audit log.

So the same architecture question gets asked nine times in an afternoon and you pay for all nine. A loop spirals in one service and no other service can see it. And when someone in compliance asks what your AI tools sent to a third party last quarter, the honest answer is “we don’t know.”

The fix isn’t a better model or a sternly worded Slack message. It’s a shared layer that every tool routes through. So I built one.

It's called AgentMesh, and it's open source. It runs as a proxy that speaks the same wire format as Anthropic and OpenAI, so you point your tools at it with two environment variables and change nothing else:

export ANTHROPIC_BASE_URL=http://localhost:8080
export OPENAI_BASE_URL=http://localhost:8080/v1

From that moment, every call from Claude Code, from Copilot, from a curl in a script, from your own agents flows through one pipeline. It checks a shared cache. It enforces a per-team budget before the call is made, not after the damage. It routes to the cheapest model that can actually do the job. And it writes every call to a tamper-evident audit log. The agent code never knows it’s there.

The part I’m proudest of is the cache, because it does something most caches don’t.

Most “LLM caches” only catch byte-for-byte repeats which almost never happen, because people rephrase. They paste “you are a senior architect” in front of the question. They switch between “optimise” and “optimize.” They wrap things in markdown. AgentMesh strips all of that noise away first, then compares what’s left by meaning. So “You are a senior architect. Review this microservices design” and “Analyse this distributed system” land on the same cached answer. That one decision - normalize, then compare by meaning is the difference between a cache that almost never hits and one that hits most of the time.

How much does it matter? I didn’t want to publish a number you can’t check, so the benchmark runs with no API keys at all:

pip install agentmesh-proxy sentence-transformers
python examples/benchmark.py

Twenty requests, five topics, four phrasings each. Eighty-five percent of them never reached the model. Seventy-five percent lower cost. The only misses are the first call per topic exactly what you’d expect from a cold cache. Run it yourself; the entire point is that you don’t have to take my word for it.

Let me be honest about what it isn’t yet. The cache lives in memory in a single process today perfect for a local proxy, not yet right for a fleet of replicas, so a shared Redis backend is the next thing I’m building. There’s no native VS Code panel. There’s no SAML identity propagation. I’d rather ship a small, verifiable core than a wide surface of half-finished features and I’m saying so out loud because that’s the kind of project I’d actually trust.

Here’s the bet underneath all of it. Every primitive we need to make AI affordable and accountable already exists caching, routing, budgets, signed logs. What’s been missing is a single layer that combines them and sits in front of every tool at once, instead of every team reinventing a slice of it in isolation. That layer should be open, not another vendor you have to trust. So AgentMesh is Apache 2.0, and it always will be.

If you’re running AI tools across a team and your bill is growing faster than your usage, here’s the whole ask:

Star and clone it: github.com/anilatambharii/agentmesh
Run the benchmark — sixty seconds, no keys.
Tell me where it breaks. Issues and PRs welcome, especially the Redis backend.

And if you build production AI for a living, subscribe. This is Field Notes: Production AI short, honest write-ups about making this stuff survive contact with the real world.

AgentMesh is open source under Apache 2.0 · GitHub · PyPI: agentmesh-proxy · also on Docker Hub and Hugging Face.

From Anil Prasad

I build AI that survives contact with the real world, and I write Field Notes: Production AI right here. If this was useful, two small things go a long way: subscribe so you catch the next one, and star the repo so the next engineer finds it too.

Try AgentMesh → github.com/anilatambharii/agentmesh · Apache 2.0 · also on PyPI (agentmesh-proxy), Docker Hub, and Hugging Face.

Find me → anilsprasad.com · X @anilsprasad · LinkedIn

Anil Prasad

Discussion about this post

Ready for more?