Topic Is Not Agenda: A Citation-Community Audit of Text Embeddings
TLDR
Text embeddings fail to capture fine-grained research agendas, leading to 80% off-agenda retrievals in scientific RAG.
Key contributions
- Text embeddings (Gemini, Qwen, SPECTER2) fail to match research agendas (L2) in 80% of top-10 retrievals.
- Built an augmented citation graph of 3.58M papers, partitioned into sub-fields (L1) and research agendas (L2).
- Failure is universal across 8 scientific domains and 4 state-of-the-art embedding models.
- A simple citation-count rerank significantly outperforms embeddings for agenda-specific retrieval.
Why it matters
This paper reveals a critical limitation of current text embeddings in capturing nuanced research agendas, impacting the reliability of scientific RAG and vector search systems. It highlights the need for new methods that go beyond simple cosine similarity to understand conceptual relatedness.
Original Abstract
Vector search and retrieval-augmented generation (RAG) rest on the assumption that cosine similarity between text embeddings reflects conceptual relatedness. We measure where this assumption breaks. We build an augmented citation graph over 3.58M scientific papers and partition it via Leiden CPM at two granularities: sub-field (L1) and research-agenda (L2, hierarchical inside each L1). Four state-of-the-art embeddings (Gemini, Qwen3-8B, Qwen3-0.6B, SPECTER2) clear the L1 bar reasonably (45-52% top-10 same-rate) but stop working at L2: only 15-21% of top-10 neighbors share the query's research agenda. In absolute terms, 8 of every 10 retrieved papers are off-agenda. The failure is universal across eight scientific domains and all four models; SPECTER2, despite its citation-based contrastive training, is the weakest. As a diagnostic probe, we test whether the same augmented graph also functions as a retrieval signal: a deliberately simple citation-count rerank reaches 57.7% top-1 L2 on top of LLM-expanded Boolean retrieval and 59.6% on top of plain BM25, on 80 curated agenda queries -- about 9 points above the best cosine retriever (Gemini, 50.6%) and 20 points above BM25 alone (39.3%). The probe isolates a slice of the agenda-matching signal the graph carries but the embeddings miss, connecting recent theoretical limits on single-vector retrieval to a concrete failure mode of scientific RAG.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.