Engineering Blog: Journey to superhuman performance on scientific tasks

Research
by 
Michael Skarlinski
Sam Cox
James Braza
Andrew White
Sam Rodriques

Share to:

We’ve been on a journey to build the best RAG system as judged by accuracy. Forget cost. Forget latency. Here we show experiments that led to the design of PaperQA2, FutureHouse’s scientific RAG system, which exceeds scientist’s performance on tasks like answering challenging scientific questions, writing review articles, and detecting contradictions from the literature. PaperQA2's high-accuracy design goal results in a different implementation from other commercial RAG systems.

What we found to be important for RAG accuracy:

  • An agentic approach, allowing for iterative query expansion.
  • LLM re-ranking and contexutal summarization (RCS), exchanging more compute for higher accuracy.
  • Document citation traversal, for higher retrieval recall beyond keyword search.

What we found to be unimportant for RAG accuracy:

  • (when using RCS) Embedding model choice, hybrid keyword embeddings, chunk size.
  • Structured parsings, though they provided better token efficiency.

We measured the impact of our PaperQA2 design decisions using metrics based on LitQA2, a set of 200 expert-crafted multiple-choice questions. Correct answers require comprehension of intermediate results within the full-text of recent scientific papers. The questions are designed to be specific enough that they can only be answered with a single source. Each question has an “Insufficient Information” option, which gives systems an opportunity to be unsure of their answers. Since the answers are multiple choice, we can automatically calculate metrics like precision, the fraction of correct answers over all questions answered (i.e. non-unsure results), accuracy, the fraction of correct answers over all questions asked, and recall, the fraction of answers which correctly reference the matching LitQA2 source paper. We use these metrics to evaluate how changes to our system affects performance.

Agentic Advantage: PaperQA2, an Agentic RAG system

Retrieval augmented generation (RAG) is a popular solution to reducing hallucination rates and providing contextual information in the prompt during text generation. A standard implementation for RAG begins with a “retrieval” step where a set of parsed documents are ranked in terms of similarity to a user’s query or prompt. A top-ranked subset of the similar documents are then injected into an LLM’s context window which is used to generate a response to the user’s query. RAG systems have flexible implementations and can include features like: semantic lookups using dense or sparse vector similarity via vector databases or traditional text search engines, hierarchical indexing relying on document metadata, re-ranking strategies to improve candidate document selection, or advanced document chunking strategies. Differing needs in RAG applications can warrant a subset of these features being used, and rather than having a user make a choice for each application, an “agent” model can be used to automatically select and utilize features.

Agentic RAG systems break components of a RAG system into modular tools, allowing an agent model to make decisions about the optimal way to progress in generating a response. This can include sub-steps like re-writing a user’s initial query to add specificity for retrieval or bolstering a poorly supported answer with more documents after reviewing output. The agent models operate by incrementally generating tool call functions in the context of a guidance prompt, tool descriptions, tool call history, and current state. An agent model’s state is then modified with each tool call, leading to a dynamic and model-driven system.

PaperQA2 tool schematic, unlike traditional RAG, our PaperQA2 agent LLM makes decisions about which tools to apply to a query.

PaperQA2 is FutureHouse’s implementation of an agentic RAG system. Because the tools are configurable functions, it can be customized by changing tool prompts at query-time. This allows us to do diverse tasks including summary generation, question answering, or contradiction detection. The system’s agentic design poses the user’s initial query or request to our agent model, which orchestrates function calls between four different tools, shown in the above figure. As a novel tool in PaperQA2, we added a citation traversal tool, which we’ll dive into in a later section. PaperQA2 agentic configuration options can be found here in Github.

PaperQA2 configurable tool descriptions and options.

We’ve included some descriptive statistics of PaperQA2’s agentic actions averaged across 3 LitQA2 benchmark runs (table below). Here we see that the agent is calling more than 4 tool calls per question on average, indicating that it’s not simply calling the tools in a deterministic order. This is especially apparent given that the citation traversal tool is only used on ~46% of questions. PaperQA2’s ability to self-correct by evaluating the quality of evidence is supported by averaging 1.26 searches per question. To capture a more objective measure of performance, we ran an experiment using the three original PaperQA tools (Paper Search, Gather Evidence, Generate Answer), where we compared a hard-coded tool order to an agentic run on LitQA2. We can see a significant performance improvement in accuracy and answer recall using an agent-based system. We find QA precision to be largely unaffected by using an agentic workflow.

LitQA2 metrics for PaperQA2 with and without using agentic decision-making.

Part of the performance increase in PaperQA2’s agentic approach can be attributed to better recall through query expansion, i.e. its ability to change the search subject, thereby narrowing or broadening the keywords used in a query. As an example, while trying to answer the LitQA query: 

Q: The cavity above p-hydroxybenzylidene moiety of the chromophore found in mSandy2 is filled by which one of the following rotamers adopted by Leucine found at position 63?
Options: A) tt B) tp C) Insufficient information to answer this question D) pt E) mp

The agent first made a keyword search query: “mSandy2 chromophore p-hydroxybenzylidene Leucine 63 rotamer”, this yielded only one paper, and 2 pieces of “evidence” after parsing, chunking, and importance ranking. The agent then made the decision to do a broader search: “mSandy2 chromophore structure and rotamers” which yielded 2 additional papers and 4 relevant “evidence” pieces. This query expansion ability is a powerful differentiator for PaperQA2, and while it may be a costly approach, it results in better accuracy.

As we continue to explore the large hyperparameter space available to our search agent, we are optimistic about the ability for agents to select inputs that influence search results in subtle ways like choosing parsing strategies or varying ranking cutoffs. As a resultant design choice, our PaperQA2 implementation maintains features that are configurable at query time. This flexibility allows us to dive into the effects of hyperparameter choices like parsing strategy, chunksize, tool selection, embedding model choices, answering model choices, and ranking cutoffs as detailed in this report. 

LLM Re-ranking and Contextual Summarization (RCS)

PaperQA2’s “Gather Evidence” step is split into two phases. First, the user query is embedded and ranked for similarity against embedded document chunks currently present in the state (added from the Paper Search tool). Then, the “top-k” ranked document chunks are used in an embarrassingly parallel, LLM prompted re-ranking and contextual summarization step, or RCS. Using the user query, citation information, and the chunk content, a prompt is formulated that asks the LLM to output both a relevance evaluation and summarization of each chunk in the context of the query (see the “Summary/Re-ranking Prompt” in Github). The LLM is prompted to score the relevance of the chunk (between 1 and 10) along with providing its summary (<= 300 words). The model’s output score is then used to re-rank the summaries before answering. 

This two phase ranking approach has benefits in a RAG system: 1. Only the summarized chunks are injected into the final answer prompt, thereby significantly lowering the token usage. With a 9,000 character chunk size, each chunk is compressed by a factor of 5.6 on average. 2. Shortcomings of embedding ranking models or any parsing abnormalities can be corrected by the LLMs during the RCS step. The models can boost poorly ranked chunks or remove strange formatting. 3. The LLM is given more opportunity to reason by breaking a complex query into subqueries that evaluate a single chunk at a time before answering. The larger the value of top-k, the more candidate chunks are evaluated by the LLM as being relevant. This produces an opportunity to exchange more compute for higher accuracy in PaperQA2. 

 We can see the RCS impact on LitQA2 performance in the Figure below. “No RCS Model” shows a significant drop-off in precision and accuracy relative to our baseline using RCS. Interestingly, we observe a minimum model performance threshold for RCS to be successful; simpler models, like GPT3.5-Turbo, perform worse than not using RCS at all. We see monotonic increases in accuracy, precision, and recall relative to the top-k depth, i.e. the number of document chunks that are utilized in the RCS step. Thus, the RCS step couples compute and accuracy in PaperQA2, using a larger top-k depth uses more compute, and increases QA accuracy. For source-specific QA tasks like LitQA2, this effect saturates at top-k depths between 20-30, where we no longer see LitQA2 accuracy or precision increases. 

LitQA2 performance for different RCS choices in PaperQA2. Top-k refers to the number of document chunks re-ranked by PaperQA2's RCS LLM.

Because the RCS step is the most expensive operation in PaperQA2, we sought methods to reduce the necessary top-k depth, thereby reducing PaperQA2’s cost per query. Having a better initial ranking during the embedding ranking phase is the most direct way to effectively reduce the RCS’s necessary top-k depth. 

To that end, we investigated the effect of different parsings, chunking, and models on PaperQA2’s ranking performance. LitQA2 provides a “key passage” for each question, which the creator has identified as crucial for correctly answering the question being asked. For each question investigated, we used the ranking depth of the document chunk containing the key passage as a metric to evaluate both the embedding ranking and re-ranking success in our experiments. In the charts below, “Key Passage @ X” refers to the percentage of LitQA2 questions in which the correct key passage chunk was ranked at X or below. 

We chose a subset of LitQA2 questions in which PaperQA2 successfully found the target papers via an organic agentic search at 3000, 9000, and 15,000 character chunk sizes, totalling 95 questions. We saved the state of those organically-obtained paper chunks for a post-hoc analysis included here. This ensured that we were including realistic distractor and competing paper chunks to measure against our key passage chunks. 

We first varied chunk size while using OpenAI’s text-embedding-3-large model and measured both the average key passage rank, along with the distribution of ranks, as can be seen in table below. Both the distribution and mean ranks imply that middling chunk sizes, ~9,000 characters, provide an optimal balance for embedding rankings. However, at deeper top-k cutoffs (20+), the ranking differences seem to disappear between the 9,000 and 15,000 chunk sizes. 

LitQA2 key passage ranking distribution at different document chunk sizes. Key passage @ X refers to the fraction of LitQA2 questions in which the key passage containing document chunk was ranked at position X or lower.

We then examined the effects of using different parsing algorithms and embedding models as shown in the following table. Here we examined key passage recall at a fixed chunk size (9,000). We compared VoyageAI’s voyage-large-2-instruct and OpenAI’s text-embedding-3-small/text-embedding-3-large models. Among these, text-embedding-3-large had the best mean ranking, though all of the aforementioned models and parsings converged to >95% recall at a depth of 20. We also appended a keyword similarity embedding to our large OpenAI embeddings, called “Hybrid” embeddings below. A simple keyword vector was used with a dimension of 256, normalized to a value of 1. OpenAI’s tiktoken library encoded each chunk into a count-encoded vector which counted each token as the token ID modulo 256. This method showed a small decrease in the mean rank, making it the best of the algorithms tested. However, the hybrid method converges with the others when looking at deep key passage ranking cutoffs. 

We examined the effect of parsing on embedding ranking efficacy as well. We hypothesized that the unconventional whitespace that results from our standard parsing library, PyMuPDF, may cause embedding rankings to do worse in practice than more human readable parsing libraries like Grobid, a deep learning based structured PDF parser for scientific literature.  However, the recall differences between the two parsing libraries appear to be negligible. In a similar trend to the chunk size experiments, different embedding models and parsing libraries seemed to have little effect on the final key passage recall at a ranking depths of 20+. 

LitQA2 key passage ranking distribution with different parsing strategies and embedding models. Key passage @ X refers to the fraction of LitQA2 questions in which the key passage containing document chunk was ranked at position X or lower.

After examining our initial key passage ranking efficacy, we investigated the effect of the re-ranking in the RCS step. We took the recall results from our 9,000 chunk size hybrid embedding experiment, and then filtered down to a subset of questions where the key passage was successfully found. The difference in key passage recall before and after using the RCS step can be seen in the figure below. We saw that the RCS step was incredibly effective at re-ranking key passages, saturating around a ranking depth of 5.

LitQA2 key passage ranking distribution before and after LLM re-ranking using Claude-Opus. Key passage capture rate shows the percentage of LitQA2 key passage containing chunks which were at a particular Ranking Depth or lower.

Our conclusion was that, for high performing embedding models, chunk sizes should be in a range of 7-11k characters, and embedding model choice has a relatively small impact on key passage recall at deeper ranking cutoffs (20+). Given the steep recall increase between ranking cutoff depths of 1 and 20, and the importance of ensuring that the key passage makes it to the final context, we encourage use of PaperQA2 with a Top-k cutoff > 15, where the RCS step has ample opportunity to correct for any embedding ranking shortfalls. This is a potentially expensive choice, but aligns with PaperQA2’s design philosophy of exchanging cost for high quality answers. The RCS step significantly improves key passage ranking, and our data shows that using ~5 sources in the final answer query is sufficient for source-specific QA tasks with PaperQA2. 

Citation Traversal: Improving Recall

We observed that, across many of our PaperQA2 experiments, cited recall of the correct LitQA2 paper source is correlated strongly with PaperQA2 accuracy. (see figure below) As expected, when the appropriate papers are found through retrieval, the questions are more likely to be answered correctly. We hypothesized that using the scientific literature’s inherent citation structure to find new papers would be an effective form of hierarchical indexing and increase our LitQA2 paper recall. We implemented this technique in a new tool, which we call the citation traversal tool for PaperQA2. 

LitQA2 accuracy vs. the percentage of the time that the correctauthor-specified DOI was included by PaperQA2 in its answer prompt.

The citation traversal tool traverses one degree of citations, both forward in time (“future citers”) and backwards in time (“past references”). This tool enables a fine-grained search around papers containing relevant information. The traversal originates from any paper containing a highly-scored contextual summary (RCS score 0-10), and our minimum score threshold was eight (inclusive). The papers corresponding to highly-scored summaries are referred to as Dprev in Algorithm 1 (shown below).

To first acquire citations, Semantic Scholar and Crossref APIs are called for past references and Semantic Scholar is called for future citers. To collect all citations for a given paper, we make one API call per provider per direction, totalling three API calls/paper. Both providers only provide partial paper details, meaning a large fraction of the time a title or DOI (digital object identifier) is not present in the response metadata. To merge citations across providers, we perform a best-effort de-duplication using casefolded title and lowercased DOI. This is shown in the GetCitations procedure In Algorithm 1. Once citations have been acquired, we compute bins of overlap, B.

To filter bins of overlap, a hyperparameter “overlap fraction” α was introduced to compute a threshold overlap θo as a function of the number of source papers (|Dprev|). For example, with an α=⅖ and traversing from six source DOIs, all citations not appearing in at least three source DOIs were discarded. The default overlap fraction we used in data collection was 1/3. See the methods section of our paper for a full distribution of overlaps seen during LitQA2 runs. Our algorithm gathers highly overlapping citations and marks them as new targets for retrieval. 

PaperQA2's citation traversal algorithm.

Citation traversal increased accuracy, as can be seen in our study with and without the tool in the figure below. However, it had no major effect on precision. Citation traversal was effective at finding a larger fraction of the correct papers across aggregated stages of PaperQA2 as can be seen in our funnel-based recall flow that examines LitQA2 paper recall aggregated by each tool stage. Especially near the top of the PaperQA2 funnel, citation traversal increased the relevant papers found by the system.

LitQA2 metrics for PaperQA2 with and without the citation traversal tool.
Stepwise LitQA2 DOI recall, i.e. the percentage of LitQA2 questions which found and utilized chunks from the correct source DOI in each phase of the PaperQA2 agentic flow. Search recall refers to obtaining papers via keyword search or citation traversal. Attribution refers to being correctly cited in the final answer.

Parsing: More Structure and Metadata

As previously reported in our WikiCrow work in 2023, PaperQA’s gene article writing performance was limited by the answer model’s ability to retain the correct gene name being referenced in the RCS model’s output. In February of 2024, we generated a new set of gene articles and re-evaluated them internally, characterizing the mistakes we saw from the model. Out of a sample of 40 random articles, we found 9 issues: the most common failure mode (6/9) was conflating gene names between the summary step and answering step, and the second most common failure mode was related to parsing: either misreading tabular data (2/9) or mistaking reference sections as article text (1/9). We decided to build two new features to mitigate these issues: enhanced summary metadata such that the model could pass context from the “Gather Evidence” step to the “Generate Answer” step, and more structured paper parsing using a deep-learning based PDF parser: Grobid. 

To enhance our metadata, we added optional JSON key-value pairs into the structured summary prompt (as seen in the example below). We then re-configured PaperQA2 to keep this metadata as part of the document context such that it would be injected into the “Generate Answer” tool as well. This way, misattribution of genes was substantially reduced as the LLM was made aware of the gene name for each summary chunk. The technique can be generally applied to any metadata that a user would like to inject in the summary step to store as context for a later answer.

## Example summary structured JSON output

{
  "summary": "The KCNE1 protein, a single-span membrane protein..(truncated)",
  "relevance_score": 10,
  "gene_name": "KCNE1"
}

We then built a service to scalably convert PDFs into Grobid parsings that allowed us to extract each paper section separately, and to transform tables into XML for accurate parsing by our downstream LLMs. This extra structure often adds to the loquaciousness of the parsing relative to PyMuPDF, as can be seen in the table below. Grobid parsings are significantly less efficient per token when split with a simple overlap scheme. These parsings include a large amount of metadata including citation attribution per sentence, which bloats the parsing character count by ~52% relative to PyMuPDF.

However, since these parsings are structured, they can be dynamically selected to a subset of the available metadata. We selected our parsings to include separate chunks per section, rather than in sequential overlaps (i.e. one chunk for abstract, discussion, results, etc.). When a section finished with < 50% of the chunk size remaining, we chose to truncate the chunk rather than continue onto the next chunk as one may do with a simple overlap algorithm. This “section” chunking algorithm is more efficient than PyMuPDF, with 44% fewer characters per paper.

The Grobid parsings are more effective at extracting tabular information than PuMuPDF, particularly when referencing complex tables with newlines within entries. A demonstrative example is included below, where under the “Description Contents” heading, newlines are included in the table. In PyMuPDF’s parsing, the newlines are indistinguishable from the default newline separator which is also used for separating columnar entries. However, Grobid’s structured XML parsing makes a clean delineation between table cells and rows.

Example table parsed by PuMuPDF and Grobid
## PyMuPDF Table Parsing from doi:10.3390/ijms21155186

Table 1. Tumor promoting effects of CCL20 within the tumor microenvironment.
Cancer Types
Specimens
Factors
Description Contents
References
HCC
Tissue
-
High expression of CCL20 in tumor tissues exacerbates
recurrence rate and survival among patients with HCC.
[36]
HCC
Tissue
BDTT
CCL20 is highly expressed in BDTT and is a poor factor in
HCC prognosis.
[37]
...

## Grobid Table Parsing from doi:10.3390/ijms21155186

{
	"Table": {
		"title": "Tumor promoting effects of CCL20 within the tumor microenvironment.", 
		"content": "<table><tr><td>Cancer Types</td><td>Specimens</td>...
	}
}

After implementing both features, in conjunction with our citation traversal and ranking depth optimizations, we generated a new gene-article evaluation trial with the goal of a direct comparison against existing Wikipedia articles. The evaluations were done exclusively by external contractors, as detailed in our methods section. The evaluators left notes on each statement being reviewed with an issue. Our newly generated articles had 23 “Cited and Incorrect” statements out of 171 valid statements to compare against. We reviewed all 23 issues post-hoc and categorized them. Notably, we saw a major reduction in “gene name conflation” between our February analysis (6 / 40 samples) vs. our external evaluation (2 / 171, p < 0.001). We were unable to find a single example in the 23 issues where a table parsing or reference section caused a hallucination in our data. We believe this demonstrates the effectiveness of structured parsing in information synthesis tasks. 

Interestingly, more detailed parsings do not seem to impact the QA efficacy of our system. When comparing parsing strategy LitQA2 accuracy, our parsing choice does not seem to impact results, as can be seen below. One reason for this may be that LitQA2 questions are extracted from the text of articles (or figure captions), and not from tables within. This demonstrates that better parsings are much more token efficient in QA tasks, as Grobid ‘section’ parsings were 44% smaller. We expect Grobid parsing performance to increase with benchmarks which mix table comprehension and article text. 

LitQA2 metrics vs. chunk size and parsing algorithm in PaperQA2

Conclusions for Scientific RAG

PaperQA2’s agentic workflow, RCS optimization, citation traversal tool, and better parsing algorithms all enabled its superhuman performance in scientific tasks. We empirically validated that using an agentic, tool-based approach to literature tasks increases accuracy. We showed that RCS steps allow for a coupling between compute and accuracy and produce resilience among embeddings or parsing model choices. In high accuracy settings, we demonstrated the need for better recall and show that citation traversal is an effective additional retrieval method.  Finally, we see that token efficiency can be significantly improved by using structured parsing algorithms. Our design choices like allowing agents to perform query expansion, using a deeper RCS top-k ranking, and using structured parsers like Grobid increase costs and query time, but these choices give PaperQA2 significantly higher accuracy in practice.

Adding more tools and parsers to our agent will open the door for richer data sources (beyond published papers), and continued optimizations to our RCS algorithm will help us get to superhuman accuracy in question-answering tasks. The task versatility with PaperQA2 shows the potential of RAG-based language agents. We’re looking forward to finding new applications which utilize interactions with the scientific literature and the new landscape available to LLM systems. If these projects or conclusions seem interesting to you, please reach out to the FutureHouse team at hello@futurehouse.org.