A User's Guide to the Encyclopaedia of DNA Elements

International effort to decode human genes releases massive dataset and instructions for use

Email newsletter

News and blog updates

Sign up

Selected ENCODE and other gene and transcript annotations in the region of the human TP53 gene on chromosome 17 GENCODE annotated isoforms of TP53 RNAs are shown in red at the top of the figure, along with annotation of the neighbouring WRAP53 gene. Below these are two mRNA transcripts, located in the first 10 kb intron of the p53 gene (U58658/AK097045, in black), from GenBank. The bottom two tracks show the structure of transcripts detected in nuclear polyadenylated RNAs from GM12878 and K562 cells, displayed according to the strand of origin (i.e. + and -). Signals are present at each of the detected p53 exons. Signals are also evident at the U58658 and AK097045 regions: such transcripts are reported to be induced during differentiation of myeloid leukaemia cells but are seen in both GM12878 and K562 cell lines. Finally the p53 isoform observed in K562 cells has a longer 3’UTR region than the isoform seen in the GM12878 cell line.

The international team of the ENCODE, or Encyclopedia Of DNA Elements, Project, have created an overview of their ongoing large-scale efforts to interpret the human genome sequence.

The April 19 publication of “A User’s Guide to the Encyclopedia of DNA Elements (ENCODE)” in the journal PLoS Biology provides a guide for using the vast amounts of high-quality data and resources produced so far by the project. All of the data, tools to study them, and the paper itself are freely available through multiple websites.

The Human Genome Project and subsequent large-scale genomic efforts were based on the belief that data from large genome projects should be made freely available, to further scientific discovery. ENCODE has accomplished this goal through their database at genome.ucsc.edu, and by creating and posting tools to facilitate data use at encodeproject.org.

“This project requires collaboration from multiple people all over the world at the cutting edge of their fields, working in a coordinated manner to figure out the function of our human genome. The importance extends beyond basic knowledge of who and what we are as humans and into understanding of human health and disease.”

Dr. Richard Myers President and director of the HudsonAlpha Institute for Biotechnology and one of the 25 principal investigators of the project

In their publication, the team demonstrates how the data can be immediately useful in interpreting associations between single nucleotides and disease. DNA variants upstream of the c-Myc proto-oncogene are associated with multiple cancers, but until recently the mechanism behind the association was unknown. ENCODE data confirm recent observations by other groups that the variants can change binding of transcription factor proteins to an enhancer region, leading to changes in expression of the c-Myc gene and therefore to oncogenesis. Similar studies are now possible for the thousands of variants identified in genome-wide association studies, addressing mechanistic questions of susceptibility to disease.

As part of the international effort, the Wellcome Trust Sanger Institute and its collaborators aim to develop accurate, evidence-based descriptions (annotations) for all gene features in the entire human genome through the GENCODE sub-project. Their aim is to deliver annotations for all protein-coding loci, non-protein-coding loci where there is evidence of gene activity, and pseudogenes. The team curates the annotations manually as well as using computation analysis and work in biology laboratories, incorporating the work of the Sanger HAVANA and Ensembl groups into the GENCODE gene set.

“Our commitments from the Human Genome Project included freely available data, high-quality genome tools and the most complete and accurate annotation that expert analysis could provide. GENCODE delivers on these commitments, providing the most accurate and complete gene annotation to ensure that the research of biomedical scientists around the world can proceed as rapidly and confidently as possible. The entire ENCODE Project is delivering data sets informing how our genome functions that builds on the foundations laid a decade ago.”

Dr Tim Hubbard who leads GENCODE at the Wellcome Trust Sanger Institute ​

The Gencode gene sets are used by the entire ENCODE consortium and by many other projects (such as the 1000 Genomes Project and the International Cancer Genome Consortium) as reference gene sets.

“We knew four years ago, from our publication of ENCODE techniques on 1 per cent of the genome, that we had an unprecedented view of how biology works on those regions. By extending our work to the entire genome, we see the immediate impact on the interpretation of noncoding variants identified in genome-wide association studies. These studies are disease-driven but have not always yielded clear next steps, and ENCODE data provide those scientists with some new paths to follow.”

Dr. Ewan Birney Senior team leader at the EMBL-European Bioinformatics Institute in the UK and another principal investigator

The current paper not only tells how to find the data, but also explains how to apply the data to interpret the human genome. One can think of determining the human DNA sequence alone as finding a new language, but without a key to interpret the letters within. The ENCODE project adds data such as where RNA is produced from our DNA, where proteins bind to DNA, and where parts of our DNA are augmented by additional chemical markers. These proteins and chemical additions are keys to understanding how different cells within our bodies are interpreting the language of DNA.

“ENCODE resources are already being used by scientists for discovery but they are also used in teaching classes now, to train students in all areas of biology. Our classes are using real data on genomic variation and function in their problem sets, shortly after our labs have generated them.”

Dr. Ross Hardison Professor at Penn State University and principal investigator

Scientists with the ENCODE Project are applying up to 20 different tests in 108 commonly used cell lines to compile these important data.

“Assays that are now fundamental to biology, such as chromatin immunoprecipitation and sequencing or ChIP-seq, were produced by the ENCODE Project. Widely used computational tools for processing and interpretation of large-scale functional genomic data have also been developed by the project. The depth, quality, and diversity of the ENCODE data are unprecedented.”

Dr. John Stamatoyannopoulos Assistant professor at the University of Washington and principal investigator

“It is now becoming commonplace to sequence genomes. The biggest challenge ahead is to understand the information that is generated. The ENCODE project is shedding light on the vast sea of information that is present in a human genome sequence including where the genes are and the sequences that control them.”

Dr. Michael Snyder Professor and chair at Stanford University and ENCODE principal investigator

More information

Funding

The ENCODE Project is funded by the National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.

Selected websites

Participating Centres

A list of participating centres can be found at http://www.genome.gov/26525220

Publications:

Loading publications...

Selected websites

  • The Wellcome Trust Sanger Institute

    The Wellcome Trust Sanger Institute, which receives the majority of its funding from the Wellcome Trust, was founded in 1992. The Institute is responsible for the completion of the sequence of approximately one-third of the human genome as well as genomes of model organisms and more than 90 pathogen genomes. In October 2006, new funding was awarded by the Wellcome Trust to exploit the wealth of genome data now available to answer important questions about health and disease.

  • The Wellcome Trust

    The Wellcome Trust is a global charitable foundation dedicated to achieving extraordinary improvements in human and animal health. We support the brightest minds in biomedical research and the medical humanities. Our breadth of support includes public engagement, education and the application of research to improve health. We are independent of both political and commercial interests.