TraceSearch - Seeking among sequences
The Wellcome Trust Sanger Institute launches today a search engine that speeds up the hunt for important DNA sequences more than 100-fold. Just as important, the engine scans all known sequence data from all available organisms, saving time for researchers, who, until the release of TraceSearch, had to inspect each organism in turn.
When studying genes, biomedical researchers often need to find related sequences in the public databases because those might hold clues to the function of the gene, or might identify a critical variant involved in disease. If the sequences are not complete, which is true for most large genomes, the key to finding the parts not in the assembled sequence is to search the original data – the sequence ‘traces’. The new engine makes that comparison much more efficient.
Until the development of the new search system, the only option was to query sequences from one species at a time: those queries were placed in a queue and the results returned after 10 minutes or longer. The new search system searches all species and works ‘live’, returning results in just 3-5 seconds.
“The TraceSearch achieves its speed by indexing the DNA sequences, rather than performing a computational expensive alignment for all sequences in the database: we only ever do that for the very best hits from the index. Classical methods of DNA alignment rely on dynamic programming methods which slow down dramatically as the size of the database increases. Even conventional indexing methods produce a massive index which has to be accessed from a hard disk.
“We use the SSAHA algorithm, which exploits the fact that DNA has an alphabet of only four letters, to construct a small index which fits into the RAM of the computer. As the database grows the time to do an index look-up remains short and we only ever perform a few alignments.”
Dr Adam Spargo Senior developer of TraceSearch
However, even using the SSAHA method the index is large – currently 300 Gigabytes – and the team made the decision to distribute it over a cluster of Linux machines, offering an inherently scalable and economical solution. The hashtable is hosted on 30 IBM LS20 blades.
This marks a transition from dynamic programming-based algorithms that run on a single large machine to a distributed index approach that runs on a cluster of commodity machines, with the final alignment phase being akin to the PageRank phase in Google, ordering the hits depending on relevance to query.
“The system not only finds absolute matches, but finds related matches. Variation in DNA sequence is vital in disease studies and we needed to build in the ability to find all sequences related to the query sequence.”
Dr Adam Spargo Senior developer of TraceSearch
This remarkable performance is built on the Sanger Institute Trace Archive, a repository of almost one trillion DNA bases. Printed on A4 paper, there would be 133 million sheets, covering the City of London three times; a single stack of paper would be 2.5 times the height of Mount Everest. TraceSearch, however, would find any piece of DNA text in a few seconds. The Sanger Institute Trace Archive collaborates with the NCBI Trace Archive to act as a global repository for all unassembled DNA sequence data. The Trace Archives complement the International Sequence Databases EMBL/Genbank/DDBJ which are the repository for assembled and annotated sequence records.
“Speed and cross-species search are the two flashing points for TraceSearch. We had previously built SSAHA single hashtables for trace search before, but we knew we would need dramatic improvements if we were to cope with growing sequence data and demands of researchers for rapid access to sequence from all species.
“This is an exciting project. The new search engine is not just an improvement, but a dramatic reinvention of the earlier engine. The alignment quality has been significantly improved without a slowdown in match speed, given the size of the database. Since no genome has been 100% finished, TraceSearch users can find holes in the genome and cross-species search may play also a role in comparative genomics.”
Dr Zemin Ning The project leader who is responsible for the development team at the Institute
The system is distributed over 30 dual-CPU IBM LS20 blades; the central server runs on a 4-CPU HP DL385; all machines run Linux 2.6.13 and are connected with gigabit ethernet. The hashtable files are stored in a redundant fashion on the local disks of the blades.
“We and other organizations around the world are generating massive amounts of sequence data and those volumes are likely to grow with new sequencing technologies. Our own researchers and others that use our databases wanted a means to search more rapidly and across all species. That is what we have delivered.”
Martin Widlake Head of database services at the Institute
Finding sequences among the world’s DNA information has been difficult, awkward and time-consuming. TraceSearch transforms that important task, giving live, accurate and useful information in real time.
More information
Sanger Institute IT
The Wellcome Trust Sanger Institute runs more than 2000 processors in its data centre and has over 350TB of disk storage. Its blade farm consistently ranks the world’s top 500 supercomputer lists. The Institute’s sites receive more than 5M visits (Page Impressions) each week and users download over 200GB each week.
Selected websites
The Wellcome Trust Sanger Institute
The Wellcome Trust Sanger Institute, which receives the majority of its funding from the Wellcome Trust, was founded in 1992. The Institute is responsible for the completion of the sequence of approximately one-third of the human genome as well as genomes of model organisms and more than 90 pathogen genomes. In October 2006, new funding was awarded by the Wellcome Trust to exploit the wealth of genome data now available to answer important questions about health and disease.
The Wellcome Trust and Its Founder
The Wellcome Trust is the most diverse biomedical research charity in the world, spending about £450 million every year both in the UK and internationally to support and promote research that will improve the health of humans and animals. The Trust was established under the will of Sir Henry Wellcome, and is funded from a private endowment, which is managed with long-term stability and growth in mind.