Your AI Agent Is Burning Tokens on Noise — Context Gateway Wants to Fix That

AI agents are expensive. Not because LLMs charge too much per token, but because agents waste most of what they send. A typical Claude Code session working through a large codebase can rack up hundreds of thousands of tokens in tool outputs alone — file reads, grep results, compiler errors, test logs — and the vast majority of that text is irrelevant to what the model actually needs to reason about next.

This is the core thesis behind Context Gateway, an open-source proxy from Compresr (YC W26) that sits between your AI agent and the LLM API. Written in Go, it intercepts requests, compresses tool outputs and conversation history using small language models, and forwards a leaner payload to the LLM. The claimed result: 76% cost reduction and 30% lower latency.

The project hit Hacker News as a Show HN in mid-March 2026, pulling in 97 points and sparking a surprisingly technical debate about whether context compression should be a standalone product or a built-in feature of LLM providers.

How Compression Works Without Losing the Signal

Context Gateway doesn’t summarize. That distinction matters. Summarization rewrites content, which means it can hallucinate variable names, alter file paths, or rephrase error messages. Instead, Compresr trained small language models (SLMs) that function as token-level classifiers — they decide what’s relevant and what’s noise without generating new text. The original structure, including code snippets and error traces, stays intact.

Three compression models ship with the proxy:

espresso_v1 — agnostic token-level compression for system prompts and static documentation
latte_v1 — query-specific compression for RAG pipelines, with up to 200x reduction on targeted workloads
coldbrew_v1 — chunk-level filtering for coarse retrieval

The default compression ratio is set at 0.5, meaning a 50% token reduction per call. The 200x number that shows up in their marketing applies to extreme RAG scenarios with latte_v1 — daily usage for most developers lands closer to that 50% figure.

What makes the architecture interesting is the safety net: Context Gateway stores all original tool outputs locally. If the LLM realizes mid-conversation that it’s missing information, it can call an expand() function to retrieve the uncompressed version on demand. It’s a bet that most compressed content won’t need to be re-expanded — and when it does, the cost of fetching it is still lower than sending everything uncompressed every time.

The 85% Threshold: Background Compaction That Doesn’t Block Your Session

One of the sharpest pain points for heavy Claude Code users is the /compact command. When your context window fills up, you run /compact, wait roughly three minutes for the model to summarize everything, and then continue. During that time, you’re blocked.

Context Gateway takes a different approach. When the context window reaches 85% capacity, the proxy automatically triggers background compression without pausing the conversation. The agent keeps working while the proxy quietly shrinks the history behind the scenes.

This is a meaningful UX improvement for long coding sessions. Anyone who’s spent 45 minutes deep in a refactoring session with Claude Code — building up context about the codebase, testing approaches, debugging failures — knows the frustration of hitting the context wall. The session either loses its thread or grinds to a halt. Background compaction addresses this by treating context management as infrastructure rather than a user action.

The EPFL Team Behind Compresr

Compresr isn’t a weekend hackathon project. The four-founder team all come from EPFL (Swiss Federal Institute of Technology Lausanne), and their backgrounds are unusually well-aligned with what they’re building:

Ivan Zakazov (CEO) — PhD research at EPFL focused specifically on LLM context compression. Previously at Microsoft and Philips Research. Published at EMNLP and NeurIPS on this exact topic.
Oussama Gabouj (CTO) — Research at EPFL’s DLab and AXA, specializing in efficient ML systems and prompt compression.
Kamel Charaf (COO) — Data Science Master’s from EPFL, former Bell Labs.
Berke Argin (CAIO) — CS from EPFL, previously at UBS.

They’re in Y Combinator’s Winter 2026 batch, partnered with Jared Friedman. The GitHub repo shows 412 stars, 34 forks, and 12 releases in the five weeks since it launched on February 10, 2026. Not viral numbers, but steady traction for a developer tool in a niche category.

How It Stacks Up Against Alternatives

Context compression is a small but growing space. Here’s where Context Gateway sits relative to other approaches:

Claude Code’s native /compact — Built-in, zero setup, but blocks your session for minutes. It also rewrites context through summarization, which can distort code-heavy conversations. Context Gateway runs in the background and preserves original tokens.

Microsoft LLMLingua — A research library using perplexity-based pruning that achieves up to 20x compression. Strong academic results, but it’s a library, not an agent-ready proxy. You’d need to build the integration yourself.

Google ADK’s compaction_interval — If you’re building on Google’s Agent Development Kit, you get native compaction built in. No installation needed, but you’re locked into Google’s ecosystem.

Headroom — The closest direct competitor. Also works as a proxy, also offers lossless compression with an expand-on-demand mechanism, and also runs locally. Headroom offers more integration options (Python library, ASGI middleware, LiteLLM callback) while Context Gateway is Go-native and more opinionated about the proxy model.

Morph Compact — Takes a verbatim compaction approach, reducing context by 50-70% while keeping every surviving sentence word-for-word. Avoids the hallucination problem of summarization-based compression.

The real competition, though, might be the LLM providers themselves. As one Hacker News commenter noted: “If this idea is good, Anthropic et al. will roll it into their own product.” The counterargument from the community was equally sharp: “Anthropic would cut their API revenue in half by rolling out compression.” Whether that economic tension protects Compresr’s market position long-term is an open question.

What the Community Is Actually Saying

The Hacker News discussion surfaced some pointed concerns worth considering:

Security risk with untrusted content. If Context Gateway compresses untrusted external content alongside trusted system instructions, you’re mixing adversarial input with your prompt before injection detection happens. For agents that process user-submitted data, this is a real attack surface.

Prompt cache invalidation. Anthropic and OpenAI both offer prompt caching — if your context prefix stays stable, you get significant cost savings. But compression changes the prefix on every call, potentially invalidating cache benefits. Depending on your usage pattern, the caching savings you lose might offset the compression savings you gain.

Fixed ratios lack nuance. A 0.5 compression ratio applied uniformly to a 200-line error traceback and a 3-line function signature treats very different content identically. Adaptive compression based on content type would be more effective, though harder to implement.

The “vibe-codeable” critique. One commenter argued that if dozens of developers could build this in hours, it shouldn’t be a standalone product. The Compresr team’s counter is that their compression models — trained on NLP research from their PhD work — are the moat, not the proxy infrastructure.

Who Should Actually Use This

Context Gateway makes the most sense for:

Heavy Claude Code users running long sessions on large codebases where token costs are a real budget concern
Teams running agents in production who need monitoring dashboards, configurable spend caps, and Slack notifications for cost control
RAG pipelines processing large documents where aggressive compression on retrieval results can dramatically cut costs

It’s probably not worth the setup overhead if you’re a casual user running short coding sessions, or if you’re already benefiting heavily from prompt caching. The proxy adds a dependency to your stack, and at version 0.5.2, it’s still early-stage software.

For teams already tracking AI spend with tools like Toolspend, Context Gateway addresses the problem from the opposite direction — instead of monitoring what you’re spending, it reduces what you spend in the first place.

Frequently Asked Questions

Is Context Gateway free?
Yes. The proxy itself is fully open-source under an open license on GitHub. Compresr also offers a hosted compression API for teams that don’t want to run the proxy locally, though pricing for the hosted version hasn’t been publicly detailed.

Does Context Gateway work with all LLM providers?
It supports any LLM accessible via OpenAI-compatible API endpoints. That covers Claude (via Anthropic’s API), GPT models, and most open-source model hosting platforms. Setup uses an interactive wizard that takes a few minutes.

How does it compare to just using a cheaper model?
Dropping to a cheaper model reduces cost but also reduces capability. Context Gateway lets you keep using a frontier model while sending it less noise. The two approaches aren’t mutually exclusive — you could compress context and use a cheaper model for even larger savings.

Can compression cause the LLM to miss important information?
Yes, this is the fundamental tradeoff. The expand() function mitigates this by letting the model request uncompressed content when it detects gaps. But this relies on the model recognizing that something is missing, which isn’t guaranteed in complex multi-step agent chains. The Compresr team’s bet is that their SLM classifiers are accurate enough that information loss is rare in practice.

What agents does it support?
Claude Code, Cursor, OpenClaw, Codex, and any custom agent that communicates via OpenAI-compatible API endpoints. The proxy is agent-agnostic by design — it intercepts at the API layer, so the agent doesn’t need to know it exists.

Top AI Product

Leave a comment Cancel reply