Aviary: training language agents on challenging scientific tasks

We are releasing code to train agents that can do challenging multi-step tasks in biology research. We’ve used this code to train agents built on small open source models to accuracies exceeding frontier language models and human PhD researchers at dramatically lower cost. This marks a major milestone in FutureHouse’s journey to automate scientific discovery with AI.

Read the preprint: arxiv.org/abs/2412.21154
See the Aviary environment code: github.com/Future-House/aviary
See the agent code: github.com/Future-House/ldp

Aviary

The mission of FutureHouse is to automate the process of scientific discovery. We want to accelerate the pace of new discoveries and solve the next greatest challenges in biology: curing diseases, elucidating the genomes of all life, and understanding the human brain. Our hypothesis is that natural language is the glue that connects literature, software, and scientists via large language models and will be the key to automating the scientific process.

We’re excited to share a major milestone on our progress on building agents that use natural language. We previously released LAB-Bench: a set of benchmarks built on real scientific tasks. LAB-Bench is not trivia or textbook knowledge; these are real tasks an AI scientist has to solve for doing biology research. We measured professional biology researchers on these tasks. Only recently has o1 exceeded human performance on one of the eight tasks, and only with consensus sampling.

We have now built software for language models, called environments, to give language models access to the same tools as human researchers. Using these environments, and a variety of learning methods, we were able to get open source language models at modest compute budgets to exceed human level performance on two more of the lab-bench tasks: doing scientific literature research and reasoning about DNA constructs.

The environments we’ve created allow agents to do over a dozen tasks from engineering proteins to summarizing literature to molecular cloning. We focused on three tasks in our preprint and showed we could scale training compute and inference compute to create super-human scores. It is also remarkable that Claude 3.5 Sonnet is able to do very well in these environments, despite the many complex tools and multi-step nature!

The cost and accuracy of models explored in this work. The trained Llama 3.1 8B agents show good accuracy at low cost, and can be improved with inference compute

This is a major milestone for FutureHouse. We now have a repeatable recipe for iterative improvement of agents on scientific tasks. We are also sharing the code for defining environments, agents, new benchmarks, and our process for training.

The combination of natural language and tool use is also found in crows, the clever bird that can both speak human language and has mastered tools for thriving in urban environments. Hence, our new framework for building agents is called Aviary and we call the trained agents that use tools Crows. We look forward to seeing what crows the open-source community can build!

Defining an agent

Agents or components of agents written as compute graphs

A major challenge we had to solve first was actually framing the language agent and environment, and which components are learnable. We chose to group all things that could be optimized into a compute graph that defines an agent. Then the tools, software, and tasks are part of the environment. The compute graph takes in an observation from the environment and emits an action - a classic definition of a policy from reinforcement learning. The main difference compared with normal deep learning models was relaxing what the nodes and edges could be in the compute graph.

This definition allowed us to test many optimizers and agent types. In the end, the simplest approach worked best and we found that repeated attempts at doing tasks and fine-tuning on successes was the most scalable strategy. At inference time, we found strong gains from majority voting/consensus sampling.

We find that having a clear framing of an agent and its learning problem has unlocked many new directions for exploration at FutureHouse.

Building an AI Scientist

A hierarchical diagram showing the relationship between AI and biological research, organized into 4 levels: Level 1 (Top): "The Quest" - Describes high-level goals like understanding how the human brain works and gene delivery. Level 2: "AI Scientist" - Shows a cyclical process with three connected components: "World Model" → "Hypothesis Generation" → "Experimentation" with feedback loops. Level 3: "AI Science Assistant" - Features a single box labeled "Agents for Specific Biological Workflows" including tasks like literature search and protein design. Level 4 (Bottom): "AI Tools" - Shows three parallel boxes: "Predictive Models" (e.g., AlphaFold), "APIs", and "Laboratory Experiments". The diagram uses different colors (blue for AI Scientist components, brown for AI Science Assistant, and gray for AI Tools) and shows connections between levels with vertical and horizontal arrows.

This is only a milestone and not the end for our journey at FutureHouse. Our roadmap is in three parts: building environments, agents that can use those tools (AI assistants), and true AI scientists that can generate hypotheses and world models. Read more in our masterplan. These AI scientists will operate at the next cognitive level, using the crows we’ve developed to complete open-ended research problems without needing to do the individual steps of cloning DNA or traversing the citation graph. We're still working on a cool name for them. Megacrow or cybercrow or something.

Ensuring model safety

The combination of AI models and synthetic biology has been identified as a potential risk area by organizations like the US and UK AI Safety Institutes. Currently, the point at which actual physical DNA is synthesized is generally recognized as when there is risk of a harmful protein or biomacromolecule and where screening for risk should be focused.

This work is based on LAB-Bench. LAB-Bench has been increasingly used as a measure of frontier large language models' ability to do synthetic biology. For example, US and UK AI safety institute's pre-deployment work use LAB-Bench as one component for evaluating model capabilities in areas of potential model risk. LAB-Bench is a “necessary" but not “sufficient" measure of doing synthetic biology. Namely, being able to answer multiple choice questions about open reading frames or translation of DNA sequences is necessary to do synthetic biology, but that is not sufficient to be capable of engineering proteins or viruses. In the Aviary work, we built agents that could do well on specific single LAB-Bench tasks. These are narrow models that do not generalize between tasks and they use environments specifically for the tasks. Therefore the high LAB-Bench task scores should not be misconstrued as some measure of general ability to do synthetic biology and not directly compared with pre-deployment assessments of frontier models.

Acknowledgements

This work was supported by everyone at FutureHouse, but especially Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White.

Work at FutureHouse is supported by the generosity of Eric and Wendy Schmidt. The results and models reported in this work used compute resources from the National AI Research Resource Pilot, including support from NVIDIA, including NVIDIA’s DGX Cloud product which includes the NVIDIA AI Enterprise Software Platform.

‍

Read the preprint: arxiv.org/abs/2412.21154
See the Aviary environment code: github.com/Future-House/aviary
See the agent code: github.com/Future-House/ldp

‍