Generative and Synthetic Genomics

Overview

The Generative and Synthetic Genomics Programme combines large-scale data generation and artificial intelligence to lay the foundations for predictive and programmable molecular biology and the routine synthesis and engineering of genomes. Our goal is to generate the data and models that will allow biology to be engineered as easily as software, electronics and cars and to greatly accelerate the understanding of genomes and the development of therapeutics.

Research Focus

Our initial focus is on phenotypes directly encoded by DNA and RNA sequence: the properties and regulation of proteins and RNAs. We see these as tractable problems that will lay the foundations for understanding higher level phenotypes, including generative models for engineering gene circuits and pathways, cells, and ultimately tissues and organisms.

The goal is to achieve a level of predictive performance such that the effects of any individual perturbation (such as a single DNA base change) or combination of perturbations (such as many DNA changes) can be accurately predicted and, moreover, the properties, activities, regulation and expression of proteins and RNAs can be designed de novo. We seek not only to accurately predict and engineer, but also to understand how, mechanistically, this prediction and engineering works.

To achieve our initial goals, our research will focus on three core areas:

  1. Single nucleotide resolution genetics
    – to understand, predict and engineer the effects of editing every nucleotide in every genome.
  2. Generative Biology
    – drawing upon Sanger’s capabilities in data generation to produce the foundational datasets and computational models to make molecular biology predictive and programmable.
  3. Synthetic Genomics
    – developing the technologies to write and edit genomes at scale and speed.

The world’s first programme for Generative and Synthetic Genomics

The Problem

Genomics and molecular biology have given us an extensive description of the components of biological systems and a reasonable conceptual understanding of how they function and fail to cause disease. However, despite this phenomenal progress, we still largely struggle to predict how biological systems respond to perturbations (even simple ones, like single DNA base changes) and engineering biology remains extremely slow and difficult.

A key problem is that, despite decades of research, the core ‘encoding’ problems of molecular biology – how sequence determines the properties and regulation of proteins and RNAs – remain practically unsolved.  That these foundational problems remain unresolved suggests a new approach is required.

Our Solution

The Generative and Synthetic Genomics Programme at the Wellcome Sanger Institute aims to practically solve these problems, drawing upon our existing capabilities and strengths in genomic data generation and large-scale data analysis.

Laboratory-based experimental genomic and biophysical data generation

The revolution that has taken place in DNA sequencing, synthesis and editing technologies now allows millions of highly quantitative perturbation experiments to be performed in parallel. Put simply, using synthesis/selection/sequencing experiments, a single PhD student today can perform more perturbations of a protein, pathway or cell than the entire global research effort could a decade ago.

Machine Learning and Artificial Intelligence to create generative and predictive models

The current paradigm shift in the application and abilities of machine learning and artificial intelligence (AI) now allows highly predictive and generative models to be developed for complex tasks, when sufficient data exists. Well-known examples include generative large language models such as ChatGPT and the landmark achievement of accurate protein structure prediction from multiple sequence alignments by Alphafold. Put simply, when data of sufficient scale and diversity exist, highly predictive and remarkably generative AI models can be developed.

We believe harnessing the power of these two technologies will enable us to solve the core encoding problems of molecular biology practically and lay the foundations for rapid and cheap biological engineering.

Our Approach

The pioneering knowledge and experience in computation and experimentation to deliver our ambitious research goals have been dispersed across the world. By drawing together at critical mass of leading computational and experimental groups from around the globe at the Wellcome Sanger Institute, the Generative and Synthetic Genomics Programme is creating the world’s first synergistic and focussed programme to predict and engineer biology.

Our approach will be to combine overwhelming experimental data generation – primarily massively parallel perturbation experiments using DNA synthesis or editing, selection and sequencing – with artificial intelligence to build predictive, mechanistic and generative models of biology. The programme will use sequencing-based assays to quantify in parallel the biophysical properties of hundreds of thousands of protein, RNA and DNA variants. Applied at scale, these approaches will allow us to generate reference atlases of mutational effects for clinical genetics.

More fundamentally, by generating datasets of sufficient size and diversity, the Programme will be able directly tackle the fundamental problems of molecular biology by applying artificial intelligence. Our initial focus will be on DNA-sequence-to-activity, including the prediction of changes in the following attributes of macromolecules from sequence:

  • expression
  • stability
  • affinity
  • specificity
  • aggregation
  • dynamics
  • allostery.

Explore both local and deep sequencing space

In addition, we will explore both local and deep sequence variations to understand how function is encoded from DNA by looking at the effects of:

  • Single DNA changes – to create reference atlases for clinical genetics.
  • Double DNA changes – to measure and understand the biophysical properties at scale
  • Combinatorial DNA changes – to understand evolutionary change and to enable rapid protein engineering
  • De novo design – to enable the generation of new proteins and regulatory elements for industry and medicine.

The long-term goal of this endeavour is not only to make biology predictive but also to make it programmable: Generative and Synthetic Genomics will expand biotechnology beyond making minor modifications to existing molecules into the vast untapped sequence space of proteins and RNAs that can exist beyond the Earth Biogenome.

Creating a global hub for Generative and Synthetic Genomics

A key objective is to bring together computational labs developing and applying AI models with experimental labs designing assays and generating data. Indeed, we believe that co-locating modellers and experimentalists is crucial to making rapid progress. There is also a huge translational potential for Generative and Synthetic Genomics, and we expect the programme to become a global hub for biological design and engineering.

Opportunities 

Generative and Synthetic Genomics is a new research programme at the Wellcome Sanger Institute with core funding from Wellcome. We are seeking three Group Leaders with ambition and vision to help us deliver our science.

To find out more and apply, please visit our job site