PaperQA2 achieves SOTA performance on RAG-QA Arena's science benchmark

Announcements
By 
Michael Skarlinski
Published 
March 5, 2025

Share to:

We're thrilled to share that FutureHouse's literature research agent (PaperQA2) scores at the top of RAG-QA Arena's science benchmark, more than 10 points higher than any other tool.

PaperQA2 performance on RAG-QA Arena (Only includes science benchmark, excludes RAG-QA's 5 other categories)

Last year we released PaperQA2, and since then we’ve been hard at work, adding features like rate limits to support users with low inference limits, a wider variety of supported LLMs, better agentic control using complete and reset tools, more metadata with OpenAlex, and the ability to query clinical trials. We’ve been utilizing it heavily at FutureHouse, and with over 7,000 stars and almost 700 forks on Github, we’re seeing adoption amongst the research community.

We evaluated PaperQA2’s performance using LitQA2 and expert evaluations of scientific article summaries, but we haven't used benchmarks which compare directly against other RAG pipelines. Inspired by Contextual.ai’s great summary of different RAG benchmarks, we decided to run PaperQA2 on RAG-QA Arena. It measures a model-determined preference between human-written answers against LLM retrieval system outputs on a large corpus of small documents. The documents span several categories but for PaperQA2’s intended audience, the science questions are most relevant, where the benchmark contains 1,404 questions based on 1.7M documents. We used these documents and questions to evaluate PaperQA2.

Our evaluation shows that PaperQA2 achieves state-of-the-art performance on the science portion of RAG-QA-Arena — 12.4% higher than the closest competitive system. You can check out how to use PaperQA2 with RAG-QA Arena to replicate these results via the tutorial here. From Contextual’s prior work, the exact methodology wasn’t clear (in terms of hyperparameters like retrieval @ k, etc.), but we’ve done our best to share our methodology directly.

So why does PaperQA2 do so well? Its RAG approach is uniquely designed for high-cost, accurate queries on large corpuses of scientific documents. In a nutshell, PaperQA2 has two differentiating components:

  1. Agentic workflows with query expansion: candidate documents are found using LLM tool calls to full-text search engines or search APIs. After a candidate answer is evaluated, the agent LLM has the opportunity to iteratively search for different documents or restructure the answer.
  2. Re-ranking and contextual summarization (RCS): starting from the candidate document chunks identified by search tools, PaperQA2 uses a two-phase relevance determination. Similar to most RAG tools, an initial embedding-based semantic ranking is performed between document chunks and an LLM-determined query text. The top chunks are then both re-ranked and transformed into contextual summaries via a mapped LLM query across each chunk. Each contextual summary prompt includes document metadata including citation counts, and estimates of journal quality. While expensive, the RCS step allows relevance to be determined using the full power of a frontier model. These contextual summaries (if deemed relevant) are used in the final answer inference rather than the raw document chunks.

For more in-depth information, check out our engineering blog post or the original PaperQA2 paper.

Check out the open source PaperQA2 here.