tl;dr: Predictive Biology is a new life science disipline at the intersection of molecular biology & machine learning. Predictive Biology focuses on measuring mutual information between biological entities and argues that predicting the outcome of an unknown experiment is equivalent to understanding a system. The field’s new tools have unlocked previously intractable questions and led to the formation of new institutions. Unlike past life science disciplines, for-profit companies may lead the frontier of this new domain.
Describing someone as a biologist tells you surprisingly little about their skills, day-to-day work, or epistemic principles. Do they study the herding patterns of African elephants during the dry season, or the structural basis for regulation of TGF-beta ligand activity in a dark crystallography room?
Over the past century, biology has arborized into subfields that address distinct problems, mirroring physics and chemistry before it. Many of these subfields are distinct enough that they represent their own intellectual disciplines. Not only do they value different questions, but they approach problems using different cognitive tools. If you describe someone as a molecular biologist, it implies both a set of technical skills manipulating nucleic acids and a bottoms-up, reductionist approach to epistemology.
Molecular biology’s historian laureate Horace Freeland Judson captures these cultural and intellectual divisions inimitably:
Molecular biology is no single province, marked off by natural boundaries from the rest of the realm. [...] Molecular biology is [...] a level of analysis, a kit of tools – which is to say it is unified by style as much as content1.
Fields are often born at the confluence of two ancestral disciplines. Molecular Biology emerged from physics and biochemistry. Systems Biology arose at the intersection of genomics and statistical mechanics.
Here, I propose that Predictive Biology is a new field that has emerged in the last five years with roots in molecular biology and machine learning2.
Predictive Biology is focused on inferring the outcomes of future experiments using quantitative models trained on a corpus of past data. Implicitly, Predictive Biologists hypothesize that biological systems contain a large amount of mutual information, so that the present and future state of one system (say, a cell’s shape) can be predicted from a description of another system (say, a cell’s gene expression profile).
Where Molecular Biology is often reductionist, Predictive Biology is emergent, assuming that many complex biological phenomena cannot be explained absent the interactions of many components. Where Systems Biology argues that mapping the individual interactions within a system will yield understanding, Predictive Biology counters that predicting the future state of a system is understanding. Where Molecular Biology was enabled by nucleic acid biochemistry and Systems Biology by early computers, Predictive Biology is built on artificial intelligence tools that learn to explain biology from data.
Predictive Biology is not superior or inferior to the fields that came before it, but it is distinct. These distinctions have enabled scientists to ask new questions, build new institutions, and found new companies. For potentially the first time in biology’s history, this new frontier may be pioneered largely in for-profit ventures rather than traditional academic institutions.
I believe that these approaches will shape the future of biology, motivating an exploration of Predictive Biology’s origins, interests, and open problems.
Epistemic lineage
Molecular Biology & the beginning of modernity
Modern biomedicine traces its roots to the intersection of chemistry and physiology that birthed biochemistry. Biochemistry might be the first subfield dedicated to the study of living systems as complex but fundamentally physical entities, rather than “vital” elements with a wholly different set of governing principles. Beginning roughly in the 1930’s, the discipline of Molecular Biology emerged from biochemistry as a distinct field. The roots of almost all modern biotechnology firms can be traced back to Molecular Biology in one form or another.
Molecular Biology is famously challenging to define. Francis Crick, the co-discoverer of DNA’s structure, once quipped:
Molecular Biology can be defined as anything that interests molecular biologists.
Alongside his clearer definition:
[Molecular Biology] is concerned with the very large, long-chain biological molecules – the nucleic acids and proteins and their synthesis. Biologically, this means genes and their replication and expression, genes and the gene products.
Molecular Biology is defined by a fundamentally reductionist approach to explaining living systems. Practitioners ask questions about the function of individual molecules and conversely, the molecules that explain a biological process.
Implicit in these questions is an underlying hypothesis – most molecules have a small number of functions, and most functions are controlled by a small number of molecules. For the reductionist approach to yield fruit, this hypothesis must hold true in at least some cases.
While it may seem overly simplistic, it’s amazing how far reductionism was able to take us! The reductionist hypothesis was sufficient to explain the molecular mechanisms of heredity and information propagation that compose the Central Dogma – DNA synthesis, transcription, and translation. Likewise, a large fraction of our knowledge about cell communication, organismal development, and pathobiology arose from picking a molecule, breaking it, and interpreting its role based on what happened.
Molecular Biology favored the reductionist approach as much by necessity as from a desire for epistemic parsimony. The technology available to early Molecular Biologists was still nascent. Fishing even a single protein out of the cytoplasmic soup of life was challenging enough!
Sequencing a single gene or protein was a years-long effort, worthy of a doctoral thesis. Interrogating the interactions of many genes or their products was intractable. Even if these interactions could be measured, interpreting their meaning would have presented considerable challenges. Biologists typically analyzed their data using the “eyeball test,” to observe binary phenotypes, or manual computation with pen and paper3.
Advances in both measurement and computation allowed a subsequent generation of biologists to begin probing at the phenomena that resist explanation by a handful of molecules.
Systems Biology & the limits of reductionism
Progress depends on the interplay of techniques, discoveries, and ideas, probably in that order of decreasing importance – Sydney Brenner4
Systems Biology is perhaps even more challenging to define than Molecular Biology. Historically, there is considerable tension between the two fields, with Sydney Brenner himself leading some critiques of early systems biologists5.
The largest contrast between the field and its predecessor is that Systems Biology is focused on emergent properties of complex biological systems that can’t be captured with reductionist experimental methods. Human biology provides a motivating example for why this approach is attractive.
Our bodies are absurdly complex, but there are only ~20,000 human genes. The basic idea of one gene mapping to one function breaks down quickly when you realize that there are far, far more functions than there are discrete genes! Clearly, there are interactions among these molecules that are greater than the sum of their parts.
Until the mid 1990s, biologists had little choice but to ignore these interactions. Even if you wanted to explore the non-linear logic of genes X, Y, and Z as they interact, the tools weren’t available to do so in a practical way. Automated DNA sequencing and synthesis sparked systems biology by providing the first tools to measure many molecules at the same time. Genomic, transcriptomic, and proteomic tools that emerged in this era allowed researchers to measure the sequences and abundance of all the genes in an organism simultaneously.
Systems Biologists try to understand systems by taking these unbiased data and building minimal models of a behavior of interest. If we imagine studying the cell cycle, a systems biologist might try to create a differential equation incorporating the abundances of many cell cycle genes to explain cell behavior. Parsimony and simplicity are often more important goals for these models than predictive performance. Systems Biologists hope to learn the mechanism of a complex process in terms of simple rules that can be written down on a napkin.
One way to frame the long-term direction of the field is in terms of a causal graph. If we imagine all the nodes in a graph as biological molecules, systems biologists hope to measure and annotate all of the edges between nodes. By quantifying all these connections, Systems Biologists hope that one day we’ll be able to design systems from scratch in a sister field known as synthetic biology.
Predictive Biology & embracing emergence
The tools of systems biology have unfortunately failed to scale beyond the simplest interactions between a few molecules. There are few differential equations that can predict complex cellular behaviors like development, immunity, or drug responses with meaningful fidelity. While noble in articulation, in practice it’s proven difficult for biologists to assemble a stack of simple rules at the micro level that’s large enough to explain dramatic, macroscopic biology.
Predictive Biology defines prediction as the core task of a biological study, rather than cataloging the functions and relationships of molecules. Implicitly, both molecular and systems biology attempt to build from these cataloging primitives to the task of prediction. If we know the function of a gene and its relationships to all others, hopefully we can infer what will happen if I activate or repress the gene. Predictive Biologists are willing to eschew the intermediary catalogs in pursuit of the understanding that arises from predictive power.
Phrased differently, Predictive Biologists are more concerned with measuring the mutual information between two biological phenomena than they are with measuring direct causality. Where Molecular Biology takes inspiration from the epistemology of classical physics, Predictive Biology borrows the cognitive tools of computer science & information theory.
This approach has only been made possible by the advent of modern machine learning (ML) methods. Until roughly the 1990s, it was practically challenging to learn models from large, complex datasets. Increases in computational power thanks to Moore’s law and algorithmic improvements made performant models more accessible around this time.
This first generation of models allowed researchers to extract more insights from emerging high throughput experiments, but largely could not predict the outcomes of experiments based on their inputs alone. Early DNA sequence models allowed researchers to search for and align similar sequences, but could not predict the effect of a previously unobserved mutation6. Simple models of gene expression could infer cell types or cancer outcomes, but could not predict the effect of inhibiting a gene on cell functions7.
If ML has been around since the 1990s, why has Predictive Biology only arisen in this decade? Computational constraints prevented early models from capturing sufficient biological context, be that a long DNA sequence or high-resolution microscopy image. Absent this context, models were limited to making relatively local predictions, hindering applications to the most complex problems in biology.
Classical biochemistry offers an analogy. Linus Pauling and Max Perutz solved biochemical structures using precise, physical models of the underlying atoms. These tools were capable of revealing secondary structures like the protein alpha-helix and the double-helix of DNA, but failed to predict the more complex tertiary structures of proteins that required simulation of physical properties at a larger scale8.
Deep representation learning tools enabled by GPU computing broke through this second barrier in roughly the 2010s. It’s now possible for researchers to learn models that capture a rich input context – long sequences of life’s code, thousands of expression profiles and the covariates of paired drug treatments, images capturing hundreds of cells across a half-dozen different phenotypic dimensions.
By capturing a more detailed portrait of biological systems, a second generation of Predictive Biology models enable in silico hypothesis testing. In addition to extracting more insights from experiments performed in the world of atoms, these models allow researchers to perform many experiments in the world of bits.
These capabilities change both the questions Predictive Biologists explore and the experimental approaches they use to render new truths from a range of latent possibilities.
Unlocking larger questions
Biology is rife with hypothesis spaces that are too large to ever search exhaustively. Testing all possible 100bp DNA sequences for enhancer activity – the ability to promote expression of a gene – would require 4^100 = ~10^60 experiments. Testing even just all combinations of 2 gene perturbations in a simple cell line would require (20,000 c 2) = ~10^8 experiments.
The traditional tools of molecular and cell biology are insufficient to explore all of these possibilities by many, many orders of magnitude. Simple questions like “What is the strongest possible enhancer for the expression of a gene?” or “What pairs of genes are essential for a cell to divide?” are surprisingly inaccessible.
Molecular Biology and its immediate descendants have made progress in the face of these daunting numbers through local searches. Given that the full space of hypotheses is too large to search, researchers use their intuitions and prior knowledge to guess at which hypotheses are the most fruitful to test.
Naturally, this leads researchers to explore hypotheses that are in an abstract sense “close,” to our existing knowledge. Perhaps we can’t test every 100bp DNA sequence for enhancer activity, but if we know several strong enhancers at about that size, a clever molecular biologist is likely to try testing mutants initialized from those promising starting points with a reasonable chance of success.
The very best researchers have a taste that allows them to guess correctly which hypotheses will be fruitful further away from our prior knowledge. I was once trained that researchers do not actually improve in their analytical skills beyond the journeyman stage, but merely get better at selecting which hypotheses to test. However, if the space of known strong enhancers is actually quite far from the global optimum, a Molecular Biologist is nonetheless unlikely to find any sequence that comes close to the true strongest enhancer.
Predictive Biology models allow researchers to take a different approach. Rather than using intuitions to navigate a local hypothesis space, researchers can focus on gathering data to train models that enable a global search.
The experiments to do so might look quite different than those a traditional molecular or systems biologist would employ. Speaking loosely, a Predictive Biologist might allocate more of an experimental budget to gather diverse data that spans the range of possibilities within a hypothesis space, in contrast to the Molecular Biologist above that would take a greedy approach and focus on testing hypotheses close to the frontier of current knowledge9.
Picking up our example of the 100bp enhancer sequence, a Predictive Biologist might run an experiment to test the activity of thousands of random sequences to promote gene expression, then train a model to predict the activity from the sequence directly. They might then use this in silico model to search for optimal sequences across the full range of possibilities, predicting the global optimum. Using these tools, it’s quite possible the Predictive Biologist could find new, potent sequences far from the range of those previously known. While this example is stylized, real world experiments to design new proteins have achieved just such results10.
Creating new institutions
Disciplines beget institutions in their image.
Molecular Biology led to the creation of the MRC Laboratory of Molecular Biology11, the Cold Spring Harbor Laboratory, and the original four horsemen of biotech – Genentech, Biogen, Genzyme, and Amgen.
Systems Biology spawned the Broad Institute, UW Genome Sciences12, Illumina, Millennium Pharmaceuticals13, and Myriad Genetics.
Predictive Biology’s institutions are still being rendered. Previous disciplines often germinated in academic centers, only then giving rise to commercial firms. Predictive Biology may be offering an inverse example.
Few academic organizations are configured to explore this intersection today, but new institutes like Arc and the Schmidt Center offer examples of where the future may blossom. By contrast, a large number of techbio firms have already emerged across diagnostics (Freenome, GRAIL) and therapeutics (BigHat, Dyno, Enveda, Excentia, Generate, Recursion, Xaira).
Growth in the private sector outpacing traditional academic environments may reflect the distinct resource requirements of Predictive Biology. Unlike Molecular Biology problems that can often be addressed by a single investigator with a modest budget, Predictive Biology is most productive when data can be generated at scale and compute is abundant.
These conditions are often easier to achieve in a for-profit endeavor. Predictive Biology has the potential to be the first biological discipline truly driven by industrial rather than academic scientists14.
Coda
I feel privileged to be living through a phase transition in my field. From the dawn of early biotech, scientists have dreamed of manipulating biology to craft a better world. We have extended lives & grown wonders once difficult to imagine, but we have yet to tame disease or design our environment.
Even the simplest cell is more complex than our most sophisticated computers. There are far more layers of abstraction than a human mind can conceive. Predictive Biology’s promise is that perhaps we need not be limited by the human mind’s ability to connect nodes on a causal graph, but rather by our ability to observe patterns sufficient to guide our search and our will to do so with vigor.
From The Eighth Day of Creation
Predictive Biology has previously been used to describe related but distinct ideas by others. Forgive me for redefining the phrase here. Prior uses of Predictive Biology as a noun include: Liu 2005, Lopatkin 2020, Covert 2021. I believe each of these uses is distinct from the definition provided here.
The epochal paper from Luria & Delbruck that established the random nature of genetic mutations famously employed a simplified statistical test to “simplify the calculation sufficiently to permit numerical computation.” They were computing by hand!
See a wonderful eulogy from Brenner’s former postdoc and my own valued mentor, Cynthia Kenyon
See for example: Brenner 2010
Hidden Markov Models were one of the first popular machine learning methods used to model DNA sequences. See Sean Eddy’s excellent contemporaneous review for details.
See an early example of how simple ML models can stratify cancer patients from Todd Golub’s group.
Descriptions of these models and their limitations are captured in Freeland Judson’s aforementioned opus, The Eighth Day of Creation.
This difference often elicits critique of Predictive Biology from predecessor disciplines that deride this form of experimentation as a “fishing expedition.”
See (1) protein binders designed with RFDiffusion that are qualitatively distinct from known binders, (2) novel proteins designed with Chroma models that are distinct from simple compositions of known domains, and (3) the demonstration that ESM3 was able to find a functional green fluorescent protein (esmGFP) about as distant from known proteins as other, new proteins discovered in nature. This design does appear to have high local homology to known proteins, but the combination of these local regions is novel.
See The Eighth Day of Creation and Gene Machine for a history and analysis of the LMB’s pivotal role in the history of modern biology.
See Luke Timmerman’s biography of Leeroy Hood, Hood, for an excellent history of the department and its emergence as a Systems Biology institution.
See Wulf & Waggoner 2010 for a case study on Millenium. Thank you to Chloe Hsu for introducing me to this series.
Physics underwent a similar transition from distributed problems with modest resource requirements to centralized problems with high barriers to entry in the mid-twentieth century. The advent of nuclear and particle physics drove the creation of large consortia to continue advancing the science. The same forces may lead Predictive Biology to concentrate within a small number of well-resourced institutions where agglomeration effects are pronounced.
Interesting and through-provoking. I agree with the author that one of the key distinguishing questions of predictive biology is:
Can the outcome of an experiment Y be predicted from observable features X?
However if this is the question that drives predictive Biologists, then the next statement cannot be true:
"Predictive Biologists are more concerned with measuring the mutual information between two biological phenomena than they are with measuring direct causality."
Please let me explain why.
If I have two molecules A and B that have high mutual information, and I perform 2 experiments where I separately perturb A and B, there are four potential outcomes:
1. A changes when B is perturbed, but B does not change when A is perturbed.
2. B changes when A is perturbed, but A does not change when B is perturbed
3. A does not change when B is perturbed, and B does not change when A is perturbed
4. A changes when B is perturbed and B changes when A is perturbed.
I think you would agree that predictions based on mutual information alone cannot distinguish among these 4 outcomes. But I would claim that predictions based on combining mutual information with causal information can.
What is causal information? It turns out those systems biology wiring diagrams that were assembled from those arduously obtained molecular biology experiments provide precisely the causal assumptions needed to distinguish among the 4 potential outcomes.
In other words, without the causal assumptions encoded in those systems biology models, data-driven machine learning alone is insufficient to succeed in predicting the outcome of an unknown experiment.
Therefore I would suggest predicting the outcome of an unknown experiment is fundamentally a causal estimation problem, not a machine learning prediction problem.
Do you think there are any limits to predictive biology (i.e. subfields of molecular biology that are fundamentally not tractable to this approach)? Similarly, do you have any predictions for subfields that will soon be transformed by predictive biology but haven't been yet?