David Levene / Wellcome Sanger Institute

Largest ever DNA sequencing dataset on UK child development studies available 

For the first time, large-scale DNA sequence data on three UK long-term birth cohorts has been released, creating a unique resource to explore the relationship between genetic and environmental factors in child health and development.

Email newsletter

News and blog updates

Sign up

Listen to this news story:
Listen to “Largest ever DNA sequencing dataset on UK child development studies available” on Spreaker.

The first resource containing high-resolution DNA sequencing data for over 37,000 children and parents collected over multiple decades from across the UK is now available to researchers worldwide.

The data release is led by the Wellcome Sanger Institute, the Children of the 90s study (also known as ALSPAC), the Millennium Cohort Study (MCS), and Born in Bradford (BiB)1, and supported by the Medical Research Council (MRC) and the Economic and Social Research Council (ESRC).

This work is supported by the ongoing efforts of Population Research UK, a UK-wide initiative led by teams at the University of Bristol and University College London, which aids longitudinal population studies by working to coordinate and connect the current research landscape.

Now available on the European Genome-phenome Archive (EGA), these high-quality genomic data can be used in combination with the existing longitudinal health and survey information provided by participating families. These combined data resources offer the scientific community the opportunity to make valuable insights in areas ranging from population genetics to the social sciences.

For example, it could be used to investigate the impact of genetic variation on neurodevelopmental conditions or childhood obesity, and how these are influenced by environmental factors.

Longitudinal research follows large numbers of participants over multiple years, repeatedly examining them at regular time points through, for example, blood tests, body measurements, and health questionnaires, to detect changes over time.

Previously, large DNA sequence datasets have typically focused on children with rare conditions or adult population cohorts. This new data release focuses on sequencing ‘birth cohorts’, which are population-based cohorts of people followed from birth through to adolescence or early adulthood.

To produce this latest data release, researchers at the Sanger Institute sequenced all 20,000 genes in the human genome, known as exome sequencing2, in samples from 8,436 children and 3,215 parents from the Children of the 90s study, 7,667 children and 6,925 parents from the MCS, and 8,784 children and 2,875 parents from BiB.

These three UK longitudinal birth cohort studies are internationally recognised and data from these cohorts have already been used to study the contribution of common genetic variants on phenotypes ranging from childhood obesity3,4 to parental nurturing behaviours5 and anxiety and depression6.

For example, by using Children of the 90s data, researchers found that a genetic variant in a gene called MC4R is associated with increased weight across childhood4 and studies like this could help design effective weight management interventions and change the way society views obesity7. That specific study used targeted DNA sequencing of the MC4R gene, whereas the new exome sequencing data reported here will allow similar investigations of other genes in the human genome. This will help drive more discoveries and research that could benefit human health.

The team has made the anonymised data as accessible as possible to approved researchers, including drafting a data note (available on Wellcome Open Research8) and other materials to help support its use by those who are less familiar with large-scale sequencing data.

In coming months, this DNA sequence data resource will be expanded to encompass all participants in these cohorts as well as additional cohorts. The value of these data will be enhanced by harmonising the data across the different cohorts, providing a more powerful resource than could be achieved by one study in isolation.

“Longitudinal population studies from the UK have already had a huge impact on biomedical research worldwide. This significant addition of whole exome sequencing data will further transform our understanding of the development of complex traits and diseases across the life course.”

Dr Carl Anderson, Interim Head of Human Genetics at the Wellcome Sanger Institute

“The UK’s cohorts and longitudinal population studies are an extraordinary national asset, made possible by the participation of a diverse range of people. The rich data and samples from these studies, when combined with whole exome sequencing, can unlock new research questions and insights into human society, development, health and ageing. MRC’s funding is part of our overall investment in understanding the drivers of disease to enable precision prevention and personalised treatments, and maximising existing infrastructure to ensure real value for money. This work aligns perfectly with a new exciting national resource that is supported by MRC and ESRC, Population Research UK, which is all about coordinating and leveraging UK cohorts.”

Dr Richard Evans, Interim Head of Population Health Sciences at the Medical Research Council

“The success of this initiative shows that coordination across cohort studies can be incredibly powerful and I’m excited to see the research that will come out of this fantastic new genetic data resource. We hope that this encourages other researchers to conduct long-term research studies, and solidifies their importance in UK and global research.”

Professor Nicholas Timpson, Co-Director of Population Research UK and Principal Investigator of the Children of the 90s study at the University of Bristol

“This represents one of the largest sequenced datasets collected at birth from the general population, creating a unique resource for the research community with multiple uses. By combining the pre-existing health and lifestyle data from the initial studies with genomic information, we have started to get a more complete view of how small DNA changes can subtly influence certain complex traits across childhood and adolescence. In the future, we plan to continue to sequence samples in other longitudinal studies, to build an evolving and high-quality resource that can benefit researchers in multiple areas of the biological and social sciences.”

Dr Hilary Martin, Group Leader at the Wellcome Sanger Institute

“Great science is built on collaboration and this release would not have been possible without the engagement of the families themselves, the hard work of teams managing these longitudinal studies, sustained investment in these cohorts, especially from Wellcome and the Medical Research Council, the sequencing and data analysis power of the Wellcome Sanger Institute, and the support of Population Research UK. We aim to continue to build on this resource and provide high-quality, accessible genomic data for researchers worldwide. This initiative further exemplifies the vast potential of bringing together the UK’s life science assets including committed research participants, researchers, governmental and charitable funding agencies, and genomic and computational capabilities.”

Professor Matthew Hurles, Director of the Wellcome Sanger Institute

More information

The data are available to approved researchers worldwide, via the European Genome-phenome Archive (EGA). To access, please visit https://ega-archive.org/.  The EGA study accession numbers are:

  • ALSPAC (study: EGAS00001005273): dataset EGAD00001015371,
  • MCS (study: EGAS00001007789): dataset EGAD00001015372
  • BiB (study: EGAS00001006978): dataset EGAD00001015370
  1. All three internationally-recognised long-term studies have followed the participants and their families for years, and include participants from across the UK. The ALSPAC study, based at the University of Bristol, has followed participants for over three decades. The MCS at University College London includes people born across England, Scotland, Wales and Northern Ireland and began in 2000. The BiB study focuses on those living in Bradford and started tracking the health and well-being of children and their parents in 2007.
  2. Exome sequencing, also known as whole exome sequencing, is a technique that sequences all the protein-coding regions of genes in a genome. In total, all of the protein-coding regions, known collectively as the exome, represent less than 2 per cent of the genome but contain around 85 per cent of known disease-related variants.
  3. Bond, R.C. Richmond, V. Karhunen, et al. (2022) Exploring the causal effect of maternal pregnancy adiposity on offspring adiposity: Mendelian randomisation using polygenic risk scores. BMC Med. DOI: 10.1186/s12916-021-02216-w.
  4. Wade, K.H., Lam, B.Y.H., Melvin, A. et al. (2021) Loss-of-function mutations in the melanocortin 4 receptor in a UK birth cohort. Nat Med. DOI: 1038/s41591-021-01349-y
  5. Wertz, T.E. Moffitt, L. Arseneault, et al. (2023) Genetic associations with parental investment from conception to wealth inheritance in six cohorts. Nat Hum Behav. DOI: 10.1038/s41562-023-01618-5
  6. A. Dennison, J. Martin, A. Shakeshaft, et al. (2024) Stratifying early-onset emotional disorders: using genetics to assess persistence in young people of European and South Asian ancestry. J Child Psychol Psychiatry. DOI: 10.1111/jcpp.13862.
  7. Body weight is not a choice. Bristol University article, available: https://www.bristol.ac.uk/alspac/participants/discoveries/body-weight/
  8. Corresponding data note available on Wellcome Open Research: https://wellcomeopenresearch.org/articles/9-390/v2

Funding:

The sequencing of the data was funded by the Wellcome Sanger Institute, with further funding from the Medical Research Council and the Economic and Social Research Council.