Beyond Retrieval: A Multitask Benchmark and Model for Code Search
Siqiao Xue, Zihan Liao, Jin Qin, Ziyin Zhang, Yixiang Mu + 2 more
TLDR
This paper introduces CoREB, a new multitask benchmark and fine-tuned reranker for comprehensive code search, addressing limitations of prior evaluations.
Key contributions
- Introduces CoREB, a new multitask benchmark for code retrieval and reranking, addressing data contamination and noise.
- Evaluates 11 embedding models and 5 rerankers across text-to-code, code-to-text, and code-to-code tasks.
- Reveals code-specialized embeddings dominate code-to-code, but short keyword queries challenge all models.
- Presents CoREB-Reranker, a fine-tuned model achieving consistent performance gains across all tasks.
Why it matters
This paper addresses critical gaps in code search evaluation by providing a robust, multitask benchmark and a high-performing reranker. It highlights the challenges of real-world developer queries and the limitations of existing models. The released data and model will drive future research in practical code search systems.
Original Abstract
Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.