NumColBERT: Non-Intrusive Numeracy Injection for Late-Interaction Retrieval Models
Haruki Fujimaki, Makoto P. Kato
TLDR
NumColBERT improves dense retrieval for numerical queries using a non-intrusive method that enhances ColBERT without modifying its core architecture.
Key contributions
- NumColBERT enhances dense retrieval for numerical queries without modifying the core late-interaction model.
- Introduces a Numerical Gating Mechanism to amplify critical numerical constraints in queries.
- Uses a Numerical Contrastive Learning objective to shape embeddings for numerical magnitudes and units.
- Achieves state-of-the-art performance while preserving existing ColBERT optimizations and deployment ease.
Why it matters
This paper tackles the challenge of dense retrieval models struggling with numerical conditions in queries, a common issue in finance and e-commerce. NumColBERT provides a practical, non-intrusive solution that significantly boosts accuracy for numerical search while preserving existing model infrastructure. This offers a maintainable and deployable approach for real-world applications.
Original Abstract
This study addresses the challenge of improving dense retrieval performance for queries containing numerical conditions, such as ``companies with more than one billion dollars in R&D expenditure.'' Although recent research has shown that standard models struggle with numeric information in domains such as finance, e-commerce, and medicine, existing solutions typically decompose queries into textual and numerical components and score them separately. These approaches modify late-interaction retrieval models such as ColBERT and introduce challenges in deployment, latency, and maintainability. To overcome these limitations, we propose NumColBERT, an inference-time non-intrusive method that enhances numerically conditioned retrieval while preserving the original late-interaction mechanism. Because NumColBERT retains the standard ColBERT indexing and MaxSim scoring pipeline, existing optimizations and ecosystem components can be reused directly, facilitating practical deployment. NumColBERT introduces a Numerical Gating Mechanism and a Numerical Contrastive Learning objective to enable numerical conditions to contribute more effectively within standard ColBERT scoring. The gating mechanism amplifies tokens carrying critical numerical constraints while suppressing context-neutral numerical mentions, and the contrastive objective shapes the embedding space to reflect numerical magnitudes, units, and conditions. Experimental results show that NumColBERT substantially outperforms standard fine-tuning baselines and achieves accuracy comparable to or better than prior approaches relying on separate textual and numerical scoring. These findings demonstrate the feasibility of numerically conditioned retrieval with a non-intrusive inference pipeline and present a maintainable solution for real-world deployment.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.