Comparison of draft human sequence versions from the public and private domain

Now that the papers from the two teams can be inspected, we can for the first time judge the relative quality of the two versions and the strengths of the methods employed the different initiatives: mapping and finishing process (public) and pure whole genome shotgun (private).

Email newsletter

News and blog updates

Sign up

Summary

The greater part of the data for Celera’s assemblies comes from the public Human Genome Project (HGP).

Despite this benefit, Celera’s assembly is only comparable with that of the HGP and is dependent upon it.

Far from “winning the race”, as they have claimed and many commentators have believed, their methodology has been found wanting.

The situation today

Now that the papers from the two teams can be inspected, we can for the first time judge the relative quality of the two versions and the strengths of the methods.

It would be expected that Celera’s version would be somewhat superior, because they had the opportunity to pick up the HGP data, as it was continuously released, and combine it with theirs. The HGP could not do the reverse, because it would require that it agree not to redistribute, which would defeat its objective of the widest possible dissemination.

The sources of the data

We first note that Celera makes no attempt to assemble their data on its own.

They use a combination of sequence data derived from the following sources: 5.1-fold coverage generated by Celera, from random shotgun sequencing, and 7.5-fold coverage generated by the public HGP, from clone-based sequencing. = 12.6-fold total coverage.

In Celera’s use of HGP data for their assemblies the sequences were artificially broken up into so-called ‘faux reads’ – that is, a set of perfect, evenly spaced sequence reads with no gaps (page 10, column 3).

Celera’s paper then incorrectly counts the ‘faux reads’ from the HGP as being the equivalent of 2.9-fold random coverage. Although the number of ‘faux reads’ corresponds to 2.9-fold coverage, the ‘faux reads’ were not randomly chosen by Celera, which would have resulted in many gaps, but rather were a perfect tiling path, with no gaps, reflecting the full 7.5-fold coverage embodied in the HGP data.

Assembly methods

Celera claims that their method of “whole genome assembly” of the sequence data into a coherent product is superior to the HGP’s method of hierarchical assembly. The latter first breaks the genome randomly (“shotguns” it) into 150,000 base pieces called clones. These are ordered (mapped), and a suitable subset is sequenced individually by a second shotgun. Celera attempts to achieve the same goal without the intermediate step.

The results of three assembly methods are shown in a table below, entitled “Assembly comparisons”.

The first column shows pure whole genome assembly. Even at 12.6X coverage, it fails badly both in sequence continuity (no of gaps) and in long-range order (no of components requiring localisation to a specific place on the genome). Connecting all the pieces and filling in the gaps between them would be a formidable task.

The third column shows the HGP’s hierarchical shotgun assembly. At this (draft) stage it has 94% coverage and many gaps, as expected. Nevertheless it is already much better than the whole genome assembly, even though it is based on a more economical 7.5X coverage. All components are localised by the mapping step, which makes “finishing” straightforward.

The second column shows Celera’s compartmentalised assembly: an attempt to make greater use of HGP data, by importing its map information as well as its shotgun coverage. Since this assembly has 68% more shotgun data than HGP alone, it should give a much better assembly. Remarkably, it appears quite similar. However, the local ordering of pieces of sequence (“contigs”) has been improved, as expected at this stage when the HGP product is unfinished.

We note in passing that Celera’s whole genome assembly of Drosophila, which was claimed as a proof of its power, also had many problems that are only now being resolved by map based finishing.

Finishing

Finishing is the process whereby additional sequence is obtained to achieve full coverage and continuity over very long distances by filling in the gaps. Only the HGP is undertaking finishing.

At this draft stage, with 94% coverage, 35% of the HGP sequence is already finished, and all will be done in the next year or two. The finished sequence, which includes the whole of chromosomes 21 and 22, is of much higher quality than any of the draft assemblies.

Conclusions

Celera’s whole genome shotgun is the method that was claimed to make methods based on clones and maps redundant and wasteful. Their compartmentalised assembly depends on HGP maps, and is really a version of HGP’s hierarchical shotgun assembly. It should be noted that Celera’s analysis uses the compartmentalised assembly exclusively.

It is difficult to escape the conclusion that pure whole genome shotgun has failed as far as generating the sequence of the human genome is concerned.

We conclude that on current showing both the mapping and the finishing process that HGP employs are essential for determining the sequence of the human and other large genomes.

Table: Assembly comparisons

Celera – 1

Celera – 2

Human Genome Project
(HGP)

Assembly

‘Whole Genome’
(WGA in Science paper)

Compartmentalised
Shotgun
(CSA in Science paper)

Clone-based shotgun

Sequence Coverage
used in assembly

5.1-fold Celera
+ 7.5-fold HGP
12.6-fold total

5.1-fold Celera
+ 7.5-fold HGP
12.6-fold total
+ HGP localisation

7.5-fold HGP
+ HGP localisation

Genome Sequence
bases whose sequence was determined

2.57 Billion
(88%)

2.65 Billion
(90%)

2.69 Billion
(92%)

Fraction of the Genome:

1. Covered by raw sequence

99.9%

>99%

94%

2. Successfully assembled 88% 90% 92%

3. Unassembled

~12%*
* 26% raw data
not localized in genome

~10%*
* 22% raw data
not localized in genome

~2%*
* < 1% raw data
all localized to individual clones

Number of contigs
(and hence gaps)

221,036

170,033

149,821

Number of scaffolds
(connected sets of contigs)

118,968
fully ordered internally

53,591
fully ordered internally

87,757
partially ordered internally

Largest Contig

1.2 Million

2.0 Million

28.5 Million

Proportion of Genome in contigs >100kb

31%

49%

46%

* Note: all percentages above assume a nominal euchromatic genome size of 2.93 Billion bases (the euchromatic genome is the part that can be sequenced with current technology, and that contains almost all the genes). Although the correct euchromatic genome size is still not known exactly, any difference from the assumption will not change relative numbers above.

More information