ArXiv TLDR

vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents

🐦 Tweet
2604.15484

Jayson Steffens

cs.IR

TLDR

vstash is a local-first hybrid retrieval system for LLM agents that uses adaptive fusion and self-supervised embedding refinement to boost search performance.

Key contributions

  • Introduces vstash, a local-first hybrid retrieval system combining vector search and full-text matching in SQLite.
  • Proposes self-supervised embedding refinement using hybrid retrieval disagreement, improving NDCG@10 by up to 19.5%.
  • Develops adaptive RRF with per-query IDF weighting, boosting NDCG@10 by up to 21.4% over fixed weights.
  • Provides a production-grade substrate with diagnostics and validates it on 50K+ queries with low latency.

Why it matters

This paper introduces vstash, an efficient local-first retrieval system that significantly improves search performance for LLM agents. Its novel self-supervised embedding refinement and adaptive fusion offer substantial gains, making advanced retrieval more accessible and performant locally.

Original Abstract

We present **vstash**, a local-first document memory system that combines vector similarity search with full-text keyword matching via Reciprocal Rank Fusion (RRF) and adaptive per-query IDF weighting. All data resides in a single SQLite file using sqlite-vec for approximate nearest neighbor search and FTS5 for keyword matching. We make four primary contributions. **(1)** Self-supervised embedding refinement via hybrid retrieval disagreement: across 753 BEIR queries on SciFact, NFCorpus, and FiQA, 74.5% produce top-10 disagreement between vector-heavy (vec=0.95, fts=0.05) and FTS-heavy (vec=0.05, fts=0.95) search (per-dataset rates 63.4% / 73.4% / 86.7%, Section 5.2), providing a free training signal without human labels. Fine-tuning BGE-small (33M params) with MultipleNegativesRankingLoss on 76K disagreement triples improves NDCG@10 on all 5 BEIR datasets (up to +19.5% on NFCorpus vs. BGE-small base RRF, Table 6). On 3 of 5 datasets, under different preprocessing, the tuned 33M-parameter pipeline matches or exceeds published ColBERTv2 results (110M params) and an untrained BGE-base (110M); on FiQA and ArguAna it underperforms ColBERTv2 (Section 5.5). **(2)** Adaptive RRF with per-query IDF weighting improves NDCG@10 on all 5 BEIR datasets versus fixed weights (up to +21.4% on ArguAna), achieving 0.7263 on SciFact with BGE-small. **(3)** A negative result on post-RRF scoring: frequency+decay, history-augmented recall, and cross-encoder reranking all failed to improve NDCG. **(4)** A production-grade substrate with integrity checking, schema versioning, ranking diagnostics, and a distance-based relevance signal validated on 50,425 relevance-judged queries across the 5 BEIR datasets. Search latency remains 20.9 ms median at 50K chunks with stable NDCG. The fine-tuned model is published as `Stffens/bge-small-rrf-v2` on HuggingFace. All code, data, and experiments are open-source.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.