Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider
Tina. J. Jat, T. Ghosh, Karthik Suresh
TLDR
This paper presents a locally-deployed, cost-effective RAG system using LLaMA and arXiv data for Q&A on Electron-Ion Collider scientific literature.
Key contributions
- Developed a local RAG system for Q&A on Electron-Ion Collider (EIC) scientific literature.
- Utilizes an in-house arXiv database and an open-source LLaMA model for answer generation.
- Offers a cost-effective, resource-constrained, and data-private alternative to cloud-based RAG.
- Specifically addresses domain-specific queries in experimental nuclear physics.
Why it matters
This paper introduces a crucial, locally-deployed RAG system for the Electron-Ion Collider, offering a cost-effective and data-private solution for domain-specific scientific Q&A. It enables researchers to leverage advanced LMs on sensitive pre-publication data without external cloud dependencies, enhancing accessibility and security in nuclear physics research.
Original Abstract
To harness the power of Language Models in answering domain specific specialized technical questions, Retrieval Augmented Generation (RAG) is been used widely. In this work, we have developed a Q\&A application inspired by the Retrieval Augmented Generation (RAG), which is comprised of an in-house database indexed on the arXiv articles related to the Electron-Ion Collider (EIC) experiment - one of the largest international scientific collaboration and incorporated an open-source LLaMA model for answer generation. This is an extension to it's proceeding application built on proprietary model and Cloud-hosted external knowledge-base for the EIC experiment. This locally-deployed RAG-system offers a cost-effective, resource-constraint alternative solution to build a RAG-assisted Q\&A application on answering domain-specific queries in the field of experimental nuclear physics. This set-up facilitates data-privacy, avoids sending any pre-publication scientific data and information to public domain. Future improvement will expand the knowledge base to encompass heterogeneous EIC-related publications and reports and upgrade the application pipeline orchestration to the LangGraph framework.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.