GrepSeek — the paper is formally “Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction” — takes a sharp position on RAG: tear out the embedding model, the vector index, and the top-k retrieval, and let the agent search the corpus the way a coding agent navigates a codebase. Just grep, find, bash, and shell pipelines.
## Why “just grep”
The paper’s argument against the standard pipeline is mechanical. Agentic search needs exact lexical matches, sparse-clue conjunctions, and multi-step hypothesis refinement — operations a fixed similarity interface fundamentally can’t express. You can’t ask an embedding model for “documents mentioning both X and Y but not Z, then narrow by date.” You can in bash. Treating the corpus as a filesystem the agent can grep through restores the operators retrieval was missing.
## Numbers and code
Direct Corpus Interaction (DCI) outperforms top baselines across 13 benchmarks, with an open implementation at DCI-Agent/DCI-Agent-Lite. The work comes from UMass Amherst, led by Alireza Salemi.
## Why it matters
Vector RAG became the default fast — but it’s the same retrieval interface coding agents long ago stopped using, because grep is faster, cheaper, and more expressive than embedding-space lookup for the patterns programmers actually care about. Importing that lesson into agentic search is the kind of move that obsoletes a layer of the stack instead of optimising it. Pair it with a model that can write good shell pipelines and the whole “vector store” assumption suddenly looks optional.

Leave a comment