Hindsight (by Vectorize) Hits 91% on LongMemEval — the Case for Giving AI Agents Human-Like Memory

RAG was supposed to be the answer to AI’s memory problem. Feed your agent a vector database full of documents, let it retrieve relevant chunks at query time, and you’ve got context-aware responses. Except when you don’t. RAG falls apart when agents need to operate across multiple sessions, track how facts change over time, or distinguish between what they’ve observed and what they believe. That’s the gap Hindsight, an open-source project from Boulder-based startup Vectorize, is designed to fill — and its 91.4% accuracy on the LongMemEval benchmark suggests the approach is working.

Hindsight has been climbing GitHub’s trending charts in March 2026, sitting at #14 on Trendshift.io with 3.8k stars. VentureBeat ran a feature calling it “20/20 vision for AI agents stuck on failing RAG.” And the project has been shipping integrations at a pace that’s hard to ignore: Ollama, Pydantic AI, OpenClaw, and MCP server support all dropped in a single week.

What Hindsight Actually Does Differently

Most memory systems for AI agents treat memory as a retrieval problem — store embeddings, run similarity search, dump context into a prompt. Hindsight takes a different path, one modeled on how human long-term memory works. Instead of hoarding raw text, it extracts structured facts, resolves entities, builds a knowledge graph, and organizes everything into distinct memory types.

The system is built around three core operations:

Retain: When an agent interacts with the world, Hindsight automatically captures and structures the information without the agent needing to decide what’s worth remembering.
Recall: When the agent needs context, Hindsight runs four parallel retrieval strategies — semantic vector similarity, BM25 keyword matching, graph traversal through shared entities, and temporal filtering. Results are merged using Reciprocal Rank Fusion and passed through a neural cross-encoder reranker.
Reflect: Periodically, the system analyzes accumulated memories to generate higher-level insights — mental models that auto-update as new information comes in.

Under the hood, two technical innovations drive the performance. TEMPR (Temporal Entity Memory Priming Retrieval) handles the recall side, enabling context-aware retrieval based on time and entity relationships. CARA (Coherent Adaptive Reasoning Agents) handles reflection, with configurable disposition parameters like skepticism, literalism, and empathy that keep reasoning consistent across sessions.

Memory itself is organized into four categories: world knowledge (facts about the environment), experiences (the agent’s own interactions and outcomes), opinions, and observations. This separation matters because it lets agents reason about what they know versus what they’ve seen versus what they think — a distinction that flat vector stores can’t make.

The Benchmark Numbers — and What They Mean

The LongMemEval benchmark is the standard test for evaluating how well memory systems handle long-running conversational scenarios. Before Hindsight, no system had cracked 90%. Here’s how the landscape looked:

System	LongMemEval Accuracy
Hindsight	91.4%
Memobase	75.78%
Mem0	Below 75%
Zep	Below 75%
Full-context baseline	~47%

That 91.4% was achieved using Google’s Gemini 3 Pro Preview. Even with smaller open-source models, Hindsight scored 83.18% (OSS-20B) and 85.67% (OSS-120B) — still well ahead of every competitor.

The numbers were independently verified by Virginia Tech’s Sanghani Center for AI and The Washington Post’s applied ML team. Naren Ramakrishnan, Virginia Tech’s AI director, noted that “AI agents are notorious for being inconsistent and brittle” and that TEMPR’s approach of allowing agents to recall successful experiences helps address that fundamental fragility.

The +44.6 point improvement over the full-context baseline is particularly telling. It means that even if you could stuff an entire conversation history into the prompt (which you can’t, practically), Hindsight’s structured approach would still retrieve better information.

How It Stacks Up Against Mem0, Zep, and Letta

The AI agent memory space has gotten crowded. Mem0, Zep, Letta (formerly MemGPT), and Cognee are all fighting for the same developer mindshare. Each takes a distinct approach:

Mem0 offers the fastest route to production with a managed SaaS that combines graph, vector, and key-value stores. It’s pragmatic and well-packaged, but it’s fundamentally still treating memory as a retrieval problem rather than a reasoning substrate.

Zep stores memory as a temporal knowledge graph, tracking how facts change over time. It’s the most enterprise-focused option, combining graph-based memory with vector search for relationship modeling. Strong on temporal reasoning, but tightly coupled to its own infrastructure.

Letta takes the open-source purist approach, exposing editable memory blocks and a stateful runtime. Agents can explicitly inspect and modify their own memory. It’s the most transparent option but requires more developer effort to get running in production.

Hindsight positions itself as the accuracy leader with the broadest integration surface. The MIT license, Docker-based deployment, and MCP protocol support mean it slots into existing agent architectures without requiring a full rewrite. The key differentiator is the structured memory model — separating world facts from experiences from beliefs — combined with the hybrid retrieval pipeline that goes beyond pure vector similarity.

Where Hindsight stands out most clearly is on benchmarks and architectural flexibility. Where it may lag behind Mem0 or Zep is in managed infrastructure maturity — Hindsight Cloud exists but is still in early access.

The March 2026 Integration Blitz

What pushed Hindsight onto GitHub’s trending page wasn’t just the benchmark results (those were published in December 2025). It was a concentrated burst of integration work in early March 2026 that made the system dramatically easier to adopt:

March 4: MCP server release. Any AI agent that speaks the Model Context Protocol can now use Hindsight as its memory backend with a JSON config change. This alone opened the door to Claude, Cursor, Windsurf, and dozens of other MCP-compatible tools.

March 6: OpenClaw integration. The Hindsight plugin for OpenClaw replaces its built-in memory layer, automatically capturing every conversation turn without the agent needing to decide what to remember.

March 9: Pydantic AI support. Five lines of code to add persistent memory to any Pydantic AI agent.

March 10: Ollama integration. Run the entire stack locally — Hindsight, PostgreSQL, and your LLM — with no API keys, no cloud costs, and no data leaving your machine. Setup takes about ten minutes.

This matters because the biggest barrier to adopting a memory system isn’t usually the technology — it’s the integration tax. By shipping native support for the tools developers are already using, Vectorize collapsed the gap between “interesting paper” and “something I can use this afternoon.”

Deployment and Pricing

Hindsight runs as a single Docker container with an embedded PostgreSQL instance. For more control, Docker Compose supports external PostgreSQL. There’s also a Python embedded mode that requires no server at all — useful for prototyping and testing.

Client SDKs are available for Python, Node.js/TypeScript, and REST. The CLI provides quick access for experimentation.

On the LLM side, Hindsight works with OpenAI, Anthropic, Google Gemini, Groq, Ollama, LMStudio, and Minimax. The Ollama integration is particularly notable for privacy-conscious deployments: everything runs locally with zero external API calls.

Hindsight Cloud offers a managed version — sign up, get an API key, connect over HTTPS. Pricing follows a usage-based model with a free tier to start. The cloud product is still in early access, which means the self-hosted path is currently the more battle-tested option.

Vectorize itself is a seed-stage startup backed by $3.6M from True Ventures. CEO Chris Latimer, formerly a Google Cloud solution architecture lead, frames the vision simply: “We wanted to build an agent memory system that works like human memory. As humans, we don’t remember everything; we extract what matters.”

FAQ

Is Hindsight (by Vectorize) free to use?
Yes. Hindsight is MIT-licensed and fully open source. You can self-host it at no cost using Docker. Hindsight Cloud offers a managed option with a free tier and usage-based pricing for scaling.

How does Hindsight compare to using RAG for agent memory?
RAG treats memory as document retrieval — store chunks, search by similarity, inject into prompts. Hindsight goes further by extracting structured facts, building entity graphs, tracking temporal changes, and separating different types of knowledge. On LongMemEval, Hindsight scored 91.4% versus roughly 47% for a full-context baseline approach.

What AI frameworks and tools does Hindsight integrate with?
Hindsight supports the MCP protocol (compatible with Claude, Cursor, and other MCP tools), Pydantic AI, OpenClaw, and works with LLM providers including OpenAI, Anthropic, Gemini, Groq, and Ollama. SDKs are available for Python and Node.js.

Can I run Hindsight completely locally without any cloud services?
Yes. With the Ollama integration released in March 2026, you can run Hindsight with a local LLM, local PostgreSQL, and no external API calls. No data leaves your machine.

Who built Hindsight and is it actively maintained?
Hindsight is built by Vectorize AI, a Boulder, Colorado startup founded in 2024 with $3.6M in seed funding from True Ventures. The GitHub repository has 639 commits, 3.8k stars, and 270 forks as of March 2026, with active development and community channels on Slack and GitHub Discussions.

Top AI Product