Back to Research

WikiCrow: Automating Synthesis of Human Scientific Knowledge

Research

December 8th, 2023

Sam Cox, Michael Hammerling, Jakub Lála, Jon Laurent,
Sam Rodriques, Matt Rubashkin, Andrew White

As scientists, we stand on the shoulders of giants. Scientific progress requires curation and synthesis of prior knowledge and experimental results. However, the scientific literature is so expansive that synthesis, the comprehensive combination of ideas and results, is a bottleneck. The ability of large language models to comprehend and summarize natural language will transform science by automating the synthesis of scientific knowledge at scale. Yet current LLMs are limited by hallucinations, lack access to the most up-to-date information, and do not provide reliable references for statements.

Here, we present WikiCrow, an automated system that can synthesize cited Wikipedia-style summaries for technical topics from the scientific literature. WikiCrow is built on top of FutureHouse’s internal LLM agent platform, PaperQA, which in our testing, achieves state-of-the-art (SOTA) performance on a retrieval-focused version of PubMedQA and other benchmarks, including a new retrieval-first benchmark, LitQA, developed internally to evaluate systems retrieving full-text PDFs across the entire scientific literature.

As a demonstration of the potential for AI to impact scientific practice, we use WikiCrow to generate draft articles for the 15,616 human protein-coding genes that currently lack Wikipedia articles, or that have article stubs. WikiCrow creates articles in 8 minutes, is much more consistent than human editors at citing its sources, and makes incorrect inferences or statements about 9% of the time, a number that we expect to improve as we mature our systems. WikiCrow will be a foundational tool for the AI Scientists we plan to build in the coming years, and will help us to democratize access to scientific research.

WikiCrow

Enter a gene name below

ABCC8

ACAD10

ACOX2

ADH7

AHI1

ANGPT2

ANKLE1

ATP5PO

ATP6AP2

C1QL3

CAPN2

CD276

CD7

CDH10

CDK5RAP3

CFAP44

CHRNB4

CHTOP

CPM

CPQ

CPT1C

CTIF

CXXC4

CYP4F3

DHRS3

DRG1

EMP1

FBH1

FSTL4

GGT1

GPAT4

HBG2

HDGF

HMGN5

HUNK

INSL6

IYD

JOSD1

JPH2

KLHL41

KLK12

KRT15

LAP3

LGMN

LMOD1

MDM1

MIEF2

MKKS

MRNIP

MRPS27

MSL1

MT1B

MT1M

MTCL2

MTF1

MTMR6

MTRES1

NARS2

NBEAL1

NECAP1

NKIRAS2

NME7

NMU

NR2C1

NUP37

NXPH3

OSBPL5

PADI6

PPP1R13L

PRAMEF7

RASGRP2

REL

REM2

RGL2

RNF186

RSPH1

RXFP2

SAMD9L

SAR1A

SCAMP2

SCGB1A1

SLC25A51

SOX30

STOML2

SYCP2

SYT9

TAF12

TEX15

TFAM

TIMM10B

TMEM258

TMEM79

TTLL5

UBE2E2

UBXN6

UNC5D

USP12

VCF2

WDR47

WDR48

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

COLLAPSE

EXPAND

TMEM79

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

UBXN6

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

WDR48

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

WDR47

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

VCF2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

USP12

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

UNC5D

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

UBE2E2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

TTLL5

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

TMEM258

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

RGL2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

TIMM10B

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

TEX15

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

TFAM

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

TAF12

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

SYT9

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

REL

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

RASGRP2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

SYCP2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

STOML2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

SOX30

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

SCGB1A1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

SLC25A51

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

SCAMP2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

SAMD9L

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

SAR1A

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

RXFP2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

RSPH1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

RNF186

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

REM2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

PRAMEF7

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

PPP1R13L

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

PADI6

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

NXPH3

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

OSBPL5

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

NUP37

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

NR2C1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MKKS

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MDM1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

NBEAL1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

NME7

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

NMU

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MTRES1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

NKIRAS2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

NECAP1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

NARS2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MTMR6

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MTF1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MTCL2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MT1M

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MT1B

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MSL1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MRPS27

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

KLK12

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MRNIP

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

MIEF2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

LMOD1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

KRT15

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

LGMN

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

LAP3

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

KLHL41

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

C1QL3

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

JOSD1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

JPH2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

IYD

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

INSL6

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

HMGN5

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

HUNK

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

HDGF

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

HBG2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

GPAT4

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

GGT1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

EMP1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

FSTL4

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

DHRS3

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

FBH1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

DRG1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CYP4F3

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CXXC4

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CTIF

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CPT1C

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CPQ

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CPM

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CHTOP

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CHRNB4

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CDK5RAP3

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CFAP44

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CDH10

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CD7

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CD276

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

CAPN2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

ATP6AP2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

ANKLE1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

ATP5PO

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

ACAD10

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

ANGPT2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

ACOX2

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

ADH7

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

AHI1

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

ABCC8

https://storage.googleapis.com/fh-public/wikicrow/.txt

...

No results

Sorry, we couldn’t find any results that matched your search terms.

Loading details...

Background

If you’ve spent time in molecular biology, you have probably encountered the “alphabet soup” problem of genomics. Experiments in genomics uncover lists of genes implicated in a biological process, like MGAT5B and ADGRA3. Researchers turn to tools like Google, Uniprot or Wikipedia to learn more, as the knowledge of 20,000 human genes is too broad for any single human to understand. However, according to our count, only 3,639 of the 19,255 human protein-coding genes recognized by the HGNC have high-quality (non-stub) summaries on Wikipedia; the other 15,616 lack pages or are incomplete stubs. Often, plenty is known about the gene, but no one has taken the time to write up a summary. This is part of a much broader problem today: scientific knowledge is hard to access, and often locked up in impenetrable technical reports. To find out about genes like MGAT5B and ADGRA3, you’d end up sinking hours into reading the primary literature.

WikiCrow is a first step towards automated synthesis of human scientific knowledge. As a first demo, we used WikiCrow to generate drafts of Wikipedia-style articles for all 15,616 of the Human protein-coding genes that currently lack articles or have stubs, using information from full-text articles that we have access to through our academic affiliations. We estimate that this task would have taken an expert human ~60,000 hours total (6.8 working years). By contrast, WikiCrow wrote all 15,616 articles in a few days (about 8 minutes per article, with 50 instances running in parallel), drawing on 14,819,358 pages from 871,000 scientific papers that it identified as relevant in the literature.

Our articles are still far from perfect. To evaluate WikiCrow, we randomly selected 100 statements and asked:

Is the statement cited? Is there a nearby citation that is clearly intended to support this statement, and is the citation relevant?
‍Is the statement correct according to the citation? Does the cited literature contain the information that is presented in the statement being evaluated?

All statements were thus characterized as either having irrelevant or missing citations; being cited and correct; or being cited and incorrect. We then repeated the same process for human-written articles. The results are as follows:

As you read WikiCrow articles, you will see incorrect statements about 9% of the time. You may also see repetitive statements, or citations that aren’t correct. We expect that these errors will become rarer as the underlying models and techniques improve. On the other hand, WikiCrow is much better at providing citations than human authors. Make sure to check any information you read here yourself before relying on it, and please alert us to any errors you may find. For more technical details, read on:

PaperQA as a Platform for WikiCrow

WikiCrow is built on top of PaperQA, a Retrieval-Augmented Generative (RAG) agent that, in our testing, can answer questions over the scientific literature better than other LLMs and commercial products. (See our paper on PaperQA) PaperQA reduces hallucinations, provides context and references for how an answer was generated, is orders of magnitude faster than humans, and retains accuracy on par with experts.

PaperQA is more than just a search tool; it is an adaptive system that uses tools based on the question and intermediate research. These tools include:

SEARCH: finding relevant papers from online databases, such as Arxiv and Pubmed;
GATHER_EVIDENCE: parsing and summarizing text from these papers;
ANSWER_QUESTION: ranking the relevance of the gathered context and synthesizing information into a final answer.

This process is non-linear. For example, if PaperQA sees a paper that uses a different word to refer to a concept, it can go back and search again with the new nomenclature. Compared to a standard RAG, PaperQA makes four key changes (each improved performance, measured via ablation testing):

PaperQA breaks down the Retrieve and Generate (RAG) process into tools for an AI agent, enabling it to perform multiple searches with various keywords whenever the information at hand isn't enough.
PaperQA employs a Map-Reduce inspired approach to summarization, where the AI first collects (maps) evidence from a range of sources and then condenses (reduces) this information to provide an answer. This increases the amount of sources that can be considered, enabling the LLM to provide preliminary insights before composing the final answer.
PaperQA uses a hybrid search approach to work on all accessible papers, which number in the 100s of million. Namely, PaperQA uses LLM-assisted keyword search at the corpus level and semantic search at the granular level of pages of text.
PaperQA implements prior-knowledge prompting strategies to access and utilize the underlying knowledge embedded within language models, when needed finding evidence in the scientific literature, and uses the resulting answer as a type of posterior knowledge.

Importantly, PaperQA builds upon the unique structure of scientific literature – its citation graph and categorization into journals and fields. This is only possible due to the excellent contributions of the Semantic Scholar team at Allen Institute for AI, whose API for exploring the citation graph of science is a key feature of PaperQA. We plan to make the full WikiCrow and PaperQA code available on GitHub soon. Until then, the essential aspects of the PaperQA algorithm are available (although you will need access to your own repository of full text scientific articles), as well as the prompts used for WikiCrow.

Benchmarking PaperQA

In our evaluations, PaperQA outperforms GPT-4, Perplexity, and other LLMs, as well as commercial RAG systems on several benchmarks. We show excellent performance on two scientific question-answering benchmarks - MedQA-USMLE and PubMedQA Blind, the latter of which is a modified version of PubMedQA, where original contexts are removed to challenge the system to find the papers to retrieve the context. Additionally, PaperQA outperforms a range of systems on LitQA, a new benchmark that we developed to validate our performance. LitQA consists of multiple-choice questions that are difficult or impossible to answer accurately without retrieval of one or more specific papers, all of which were published after the training cutoff dates of GPT-4 and Claude 2 in 2022. Today, LitQA is small, with only 50 questions, as it is extraordinarily time-consuming to generate and validate these types of questions, but we plan to scale it up in the future. Also note that we performed this testing in October 2023 (outside of Gemini Pro in December 2023) and did not try to optimize any of the commercial systems here, so it’s possible they could be engineered for higher performance, or would have higher performance if tested today.

WikiCrow Mechanics

We carefully prompt the PaperQA agent to collect information on specific genes from scientific papers for each essential Wikipedia article section: Structure, Function, Interactions and Clinical Significance. To develop these prompts, we started with Wikipedia’s existing molecular biology style guide, then made significant changes over several empirical iterations. This highlights the continued importance of prompt engineering and the need for improved alignment strategies.

Afterwards, we use another LLM call to edit these four independent sections into a coherent and concise Wikipedia-style article, appending an Overview paragraph to the top, while maintaining all citations. The specific prompts used are available. Additionally, we are in conversations with Wikipedia about hosting these articles, and will continue to make our versions available programmatically; for example you can use this gsutil command to list all genes available for download: gsutil ls gs://fh-public/wikicrow/

Statements from human-written Wikipedia articles usually failed evaluation due to irrelevant, inappropriate, or absent citation support. We believe this stems from the varying quality of authorship, as well as the format of Wikipedia not requiring all statements to be justified with peer reviewed articles. Interestingly, statements from WikiCrow AI generated articles follow an opposite pattern, where the majority of statements fail due to incorrect transmission of information from the cited article. This was typically due to the model’s difficulty discerning highly similar gene names (e.g., GSDMD vs. GSDME), or failure to parse the logic of complex sentences, such as “knockdown of a repressive gene”, which is a clause with multiple negatives.

Evaluation of performance of LLMs powered by RAG is a new area of study, and this evaluation strategy has several limitations and challenges, which we highlight here:

We do not evaluate absolute statement accuracy: We only evaluate whether statements are cited and whether they are true as cited; we do not evaluate whether statements are objectively accurate. Statements that are accurate but either not cited or incorrectly cited, which are probably more common in human-written Wikipedia articles, are scored as incorrect on either the “properly cited” criterion or the “true as cited” criterion. Trivially correct statements are excluded from evaluation
Evaluation is challenging to blind: WikiCrow-written articles use significantly more references to bolster individual claims, so it is usually easy to tell which articles were written by humans and which were written by WikiCrow in evaluations.
Inconsistent citation strategies: Humans use inconsistent citation strategies which require subjective evaluation. For example, we identified several cases of circular references in human-written Wikipedia articles, and we also identified several cases where human articles would cite large database entries like Entrez, rather than primary literature, which were difficult to evaluate. The need to make subjective decisions about whether to exclude such statements raises bias concerns.
Sample exclusion: Articles generated both by WikiCrow and by humans often contain trivial statements of fact, which also need to be excluded from evaluation on a subjective basis.

Despite these challenges, we think that our evaluation system is a reasonably accurate reflection of the “ground truth” quality of human-written and WikiCrow-written articles. If you have suggestions about how to improve evaluation, let us know, or consider applying to join our Assessment Team!

Conclusion

We built WikiCrow and PaperQA as foundational tools both for human researchers and for the AI Scientists we are building at FutureHouse. We plan for PaperQA to be one of many tools available to our AI Scientists, aiding in knowledge synthesis, experimental planning, hypothesis generation, and more. Moreover, PaperQA will be part of a closed-loop system, ensuring continuous and informed progression from theory to experimentation.

In addition, we believe that the WikiCrow approach will eventually enable synthesis and curation of all human scientific knowledge, in collaboration with human editors. Some directions we expect to explore include the use of dedicated models that are fine-tuned on Wikipedia edits, and improved alignment strategies to reduce the amount of prompt engineering that is needed for generation of comprehensive and coherent articles for a given topic. In the long run, we even envision a “Super-pedia,” where articles are generated about any topic in real time, on-demand, with the most up-to-date information. If you’re excited to work on this, get in touch.

Interested in using PaperQA or WikiCrow? Fill out the form here