CAF
Common assembly format (CAF).
About
It is acedb-compliant and is an extension of the ace-file format used earlier, but with support for base quality measures and a more extensive description of the Sequence data. CAF was designed during the Sanger sequencing era, its modern-day successor is the SAM format, or its binary equivalents BAM and CRAM.
Downloads
Tools for manipulating CAF files can be downloaded from the Sanger ftp site:
caftools source – Source code for the ‘caftools’ package. This includes a number of tools for manipulating CAF files.
gap2caf source – Source code for the ‘gap2caf’ package. gap2caf converts a gap4 database to CAF format. Building this requires a copy of the Staden package source code.
mini phrap2gap – A collection of perl scripts that wrap the process of running a sequence assembly with phrap and converting it to a gap4 database. It requires a number of other tools, notably the caftools package above, and phrap itself.
Further information
CAF Overview
CAF is intended to be sufficently comprehensive that any assembly engine/editor such as Phrap, Consed, Gap, Acembly, FAK etc can derive all the information it needs from the CAF file without reading any other data, (except for trace information which is still held in SCF files). The eventual aim is to be able to freely convert between CAF and any other format so that different assembly programs can be combined. This aim is still some way off because of incompatibilities between these programs. Currently it is possible to convert to and from Phrap, FAK, and into GAP (the reverse is possible with some loss of information).
CAF Version 2 supports some additional tags (Insert_size, Ligation_no) for reads. More importantly, it has support for describing partially-assembled groups of contigs. The basic idea is that an Is_assembly Sequence object is an ordered set of Is_group Contig Group objects, listed by the Group_order tag. In turn, each Contig group is an ordered set of contigs listed by the Contig_order tag. It is thus easy to manipulate groups if contigs (eg that are believed to be adjacent) without having to use absolute coordinates.
In CAF files, Comments are any text preceded by //
CAF supports three object types, Sequence, DNA and BaseQuality. The Sequence type is the most complex. All base coordinates start at position 1 (NOT 0). These are the important Sequence attributes (others are available in the CAF acedb model but are not currently used by any processing module).
Sequence type
Sequence : "Name" // Name of the Sequence Is_read | Is_contig | Is_group | Is_assembly // Type Padded | Unpadded // State ProcessStatus "State" "Text" // Reads only: Asp pass or failure, with reason. // "State" can be: // PASS, // SVEC (completely seq vector), // QUAL,(poor trace quality), // CONT (contaminant, eg E.coli) Asped "Date" // Reads only: Date processed Dye Dye_terminator | Dye_primer // Reads only: Chemistry SCF_File "Filename" // Reads only: Name of SCF file containing trace data Primer Unknown_primer | Universal_primer | Custom "Oligo" // Reads only: primer type, // including oligo sequence if Custom Template "Template" // Reads only: Template name Insert_size x1 x2 // Predicted range [x1,x2] of insert size Ligation_no "Text" // The Ligation number (ie library identifier) for the read Strand Forward | Reverse // Reads only: forward or reverse strand Seq_vec "Type" x1 x2 "Text" // Sequencing vector from position x1 to position x2 inclusive // "Type" is redundant, set to SVEC by default. Clone_vec "Type" x1 x2 "Text" // Cloning vector from position x1 to position x2 inclusive // "Type" is redundant, set to CVEC by default. Clipping "Type" x1 x2 "Text" // Clipping from position x1 to position x2 inclusive GoldenPath "Read" x1 x2 // Contigs only: Phrap Golden path: // Use "Read" between contig coords x1, x2 inclusive Tag "Type" x1 x2 "Text" // General Tag on interval [x1,x2] (used in GAP) Assembled_from "Read" s1 s2 r1 r2 // Contigs only: Alignment of Read to contig. // Interval [r1, r2] in the read align with [s1,s2] in contig. // If s1 > s2 then align the reverse complement of [r1,r2] with [s1,s2]. Align_to_SCF s1 s2 r1 r2 // Reads only: Alignment of Read to original SCF base-calls. // Similar to Assembled_from Group_order Group p1 // Assemblies only: defines Group to be at group position p1 within assembly Contig_order Contig q1 // Groups only: Defines Contig to be at relative position q1 within group
DNA type
DNA : "Name" // Name of sequence ACGTGCGG...... // The sequence: Use ACGT, N for unknowns, - for pads.
BaseQuality type
BaseQuality : "Name" // Name of sequence 0 12 13 90 ... // Base qualities. These must be positive integers // between 0 and 99 inclusive. If the Base Quality is present // in must be the same length as the DNA
Sequence Attributes
Sequence objects in the CAF file are used to represent both read and contigs (constructed during sequence assembly from reads).
Field | Description |
---|---|
Is_read | The sequence respresents a reading |
Is_contig | The sequence respresents a set of aligned reads |
Attributes common to all Sequence objects
An assembly can be in either a padded or unpadded state. In the padded state, padding characters are added to the contig sequence and the reads so that the reads align with one another perfectly. In the unpadded state, the alignment between reads and the contig is described fully.
Field | Description |
---|---|
Padded | The sequence is padded |
Unpadded | The sequence is unpadded |
Features identified within sequences are annotated using one of the following four fields. In each case, Method is either the name of a program, or some generic name describing the method. We currently use the name of the GAP4 (Bonfield et al, 1996) tag id. The feature is considered to cover bases From through To in the Sequence. Comment is optional. If more than one word, the comment should be surrounded by double quotes.
Field | Attributes | Description |
---|---|---|
Clone_vec | Method From To Comment | Location of Cloning vector. |
Seq_vec | Method From To Comment | Location of Sequencing vector. |
Clipping | Method From To | High quality bases. |
Tag | Method From To Comment | Some arbirary feature. |
Attributes of reads only
Field | Attributes | Description |
---|---|---|
Sequencing_vector | Vector_name | The name of the sequencing vector used. |
Template | Template_name | The name of the DNA template from which the sequence was determined. |
Insert_size | From To | Minimum and maximum size estimate of insert of DNA template. |
SCF_File | File_name | The name of the SCF file for this reading. |
Base_caller | Base_caller_name | The name of the base calling software used to generate read. |
Stolen | Comment | The read was not originally sequenced for the clone |
Staden_id | Integer | The internal read number from the Staden GAP database (gap2caf). |
Asped | Date | The date the read was sequenced/pre-processed. |
ProcessStatus | Text | Preprocessing status (PASS or FAIL). |
Align_to_SCF | ReadFrom ReadTo SCFFrom SCFTo | The alignment between the read and the original base calls |
Information about the sequencing chemistry used to generate the read is included. One of the following must be specified.
Field | Description |
---|---|
Dye_primer | The read was sequenced using Dye primer chemistry |
Dye_terminator | The read was sequenced using Dye terminator chemistry |
Orientation of the read with respect to the template is specified by one of the following:
Field | Description |
---|---|
Forward | The read is on the Forward strand of the template |
Reverse | The read is on the Reverse strand of the template |
The location of the read with respect to the insert is specified by one of the following:
Field | Attributes | Description |
---|---|---|
Universal_primer | Either a forward or reverse universal primer was used. | |
Custom | Primer_name | A custom primer was used. The name is optional. |
Attributes of contigs only
The alignment of the reading to the contig is described in lines:
Field | Attributes | Description |
---|---|---|
Assembled_from | ContigFrom ContigTo ReadFrom ReadTo | The alignment between the contig and the read |
Sequence State: Padded vs Unpadded
The alignment of a reading to the contig can be Padded or Unpadded. Padded means that gaps (“-“) have been inserted where required in both contig and aligned readings so that there is a 1-1 correspondence between the aligned DNAs. In a Padded assembly there is exactly one Assembled_from line for each aligned read in a contig, and the DNA objects contain “-” padding characters.
In an unpadded alignment all the pads are removed from the DNA objects and there are multiple Assembled_from lines for each reading in a contig. However, Within each Assembled_from line there is a 1-1 correspondence between the (possibly reverse-complemented) read interval [r1,r2] and the contig interval [s1,s2].
Some jobs (notably ones that need to compare reads to the contig consensus sequence) are easier with the padded alignment, while others are better done unpadded. The programs caf_pad and caf_depad allow one to move transparently between padded and unpadded states, without loss of information (well, almost – columns of pads are removed). Note that in a padded alignment with BaseQuality information it is necessary to attach a quality value to each pad to keep the lengths equal. caf_pad currently does this by interpolating the quality values of the surrounding bases.
Example:
Sequence : hh26e2.s1 Is_read Unpadded SCF_File hh26e2.s1SCF Template hh26e2 Dye Dye_primer Primer Universal_primer Strand Forward ProcessStatus PASS Align_to_SCF 1 171 1 171 Align_to_SCF 172 595 173 596 Tag DONE 119 119 "AUTO-EDIT: replaced C by t at 119 (double, isolated, strong)" Tag DONE 145 145 "AUTO-EDIT: replaced N by t at 145 (double, compound, strong)" Tag DONE 146 146 "AUTO-EDIT: replaced T by g at 146 (double, compound, strong)" Tag DONE 171 171 "AUTO-EDIT: deleted G at 171 (double, isolated, strong)" Tag DONE 193 193 "AUTO-EDIT: replaced A by c at 193 (double, isolated, strong)" Seq_vec SVEC 1 24 "M13mp18" Clipping QUAL 56 117 Clipping ECLIP 46 241 DNA : hg02b9.s1 CGCTGCAGGTCGACTCTAGAGGCTCCCCTGAGCCGCTGTGGATTGAGGAGGTGAGGCGTG AGGAGGTGAGGAGTGAGAAGGTCAGGAGGGACGGAGGTGACGAGTGAGGAGGCGAGGTGA GGCGTTAGGAGGTGGGGAAGTCAGGAGGTGAGTCAGGACCTGAGGAGTCAGGGGGTGAGG AGTTAGGTGGTCAGGAGTCAGGAGGTGACGAGTTAGGAGGTGGGGCAAGTGAGGAGGTGA GGAGTGAGGACATGACGAGTGAGTAGGTGAGGAGTCACGGGGGTCAGGAACGTGACAAGG TTTCGACGTCCAATCCATCGTTCCAGGACCTTCCAGCTTGTGTCCTCTGACAGTGACCTC ACCTGCCAGGTCTGGCCCTCCTGGCAGGCAAGAGGGCCGGCCGTGGGGGCGGTGGAGGGG GTGGCCTCCCAGGGGTGAAGTCGGGGGTTGGGCTCCGACCGTCTGGCCACCGTTGGGGGT GAGCCCGGTGGGAGTGTTGGGGGGG BaseQuality : hg02b9.s1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 15 0 0 0 0 0 0 18 17 17 0 18 0 0 0 0 0 0 0 0 0 0 19 0 15 0 0 0 0 18 17 24 22 18 21 18 18 0 15 15 21 23 21 0 0 0 0 0 21 0 0 0 16 19 23 21 0 0 0 0 0 21 21 22 0 0 0 0 0 0 0 0 0 0 0 0 18 18 23 22 23 23 15 15 17 0 23 15 21 22 21 21 21 25 19 23 19 30 18 18 18 27 21 0 0 0 0 0 0 25 18 22 15 0 0 0 15 0 21 17 0 0 0 0 0 18 0 0 0 16 16 21 16 0 0 0 0 0 16 16 0 0 0 0 0 0 15 15 0 0 15 15 0 0 0 0 0 0 0 18 18 21 20 20 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 19 19 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 0 0
This is an unpadded reading, with standard dye primer chemistry with universal primer. The base at position 172 in the SCF trace have been deleted, hence the Align_to_SCF maps read position 171 -> trace position 171, and read position 172 -> trace position 173.
Positions 1 to 31 are sequencing vector. Positions 56 to 117 are high quality, according to the base-quality measures. Positions 46 to 241 align well with the contig consensus.
The read has been autoedited (hence the deleted base) and tags attached to the edited bases.
The corresponding contig is:
Sequence : Contig3 Is_contig Unpadded Assembled_from hg82b3.s1 7635 7677 1 43 Assembled_from hg82b3.s1 7678 7697 45 64 Assembled_from hg82b3.s1 7698 7703 66 71 . . . Assembled_from hg02b9.s1 7968 7878 1 91 Assembled_from hg02b9.s1 7877 7745 93 225 Assembled_from hg02b9.s1 7744 7694 227 277 Assembled_from hg02b9.s1 7693 7607 279 365 Assembled_from hg02b9.s1 7606 7603 367 370 Assembled_from hg02b9.s1 7601 7599 371 373 Assembled_from hg02b9.s1 7598 7584 375 389 Assembled_from hg02b9.s1 7583 7570 391 404 Assembled_from hg02b9.s1 7569 7470 406 505 . . .
The reverse-complemented read hg02b9.s1 positions [1,505] align to [7470,7968] (This is the full alignment, ie not clipped back). The individual unpadded subsections align as indicated, eg [1,91] with [7878,7968].
After padding the alignment (with caf_pad) we have:
Sequence : hg02b9.s1 Is_read Padded SCF_File hg02b9.s1SCF Template hg02b9 Dye Dye_primer Primer Universal_primer ProcessStatus QUAL Align_to_SCF 1 3 1 3 Align_to_SCF 5 20 4 19 Align_to_SCF 22 37 20 35 Align_to_SCF 39 46 36 43 Align_to_SCF 48 48 44 44 Align_to_SCF 51 74 45 68 Align_to_SCF 76 80 69 73 Align_to_SCF 82 94 74 86 Align_to_SCF 96 108 87 99 Align_to_SCF 110 121 100 111 Align_to_SCF 123 128 112 117 Align_to_SCF 130 199 118 187 Align_to_SCF 201 267 188 254 Align_to_SCF 269 274 255 260 Align_to_SCF 276 278 261 263 Align_to_SCF 280 281 264 265 Align_to_SCF 283 284 266 267 Align_to_SCF 286 290 268 272 Align_to_SCF 292 292 273 273 Align_to_SCF 294 308 274 288 Align_to_SCF 311 312 289 290 Align_to_SCF 314 317 291 294 Align_to_SCF 319 334 295 310 Align_to_SCF 336 336 311 311 Align_to_SCF 338 371 312 345 Align_to_SCF 373 376 346 349 Align_to_SCF 378 398 350 370 Align_to_SCF 400 442 371 413 Align_to_SCF 444 444 414 414 Align_to_SCF 446 446 415 415 Align_to_SCF 448 468 416 436 Align_to_SCF 470 534 437 501 Align_to_SCF 536 539 502 505 Seq_vec SVEC 1 30 "M13mp18" Clipping QUAL 2 537 DNA hg02b9.s1 539 Strand Forward Clone 4B5 Asped 25-Mar-1996 DNA : hg02b9.s1 CGC-TGCAGGTCGACTCTAG-AGGCTCCCCTGAGCCG-CTGTGGAT-T--GAGGAGGTGA GGCGTGAGGAGGTG-AGGAG-TGAGAAGGTCAGG-AGGGACGGAGGTG-ACGAGTGAGGA G-GCGAGG-TGAGGCGTTAGGAGGTGGGGAAGTCAGGAGGTGAGTCAGGACCTGAGGAGT CAGGGGGTGAGGAGTTAGG-TGGTCAGGAGTCAGGAGGTGACGAGTTAGGAGGTGGGGCA AGTGAGGAGGTGAGGAGTGAGGACATG-ACGAGT-GAG-TA-GG-TGAGG-A-GTCACGG GGGTCAGG--AA-CGTG-ACAAGGTTTCGACGTC-C-AATCCATCGTTCCAGGACCTTCC AGCTTGTGTCC-TCTG-ACAGTGACCTCACCTGCCAGG-TCTGGCCCTCCTGGCAGGCAA GAGGGCCGGCCGTGGGGGCGGT-G-G-AGGGGGTGGCCTCCCAGGGGT-GAAGTCGGGGG TTGGGCTCCGACCGTCTGGCCACCGTTGGGGGTGAGCCCGGTGGGAGTGTTGGG-GGGG BaseQuality : hg02b9.s1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 15 0 0 0 0 0 0 18 17 17 0 18 0 0 0 0 0 0 0 0 0 0 0 0 19 0 15 0 0 0 0 18 17 24 1 22 18 21 18 18 0 15 15 21 23 21 0 0 0 0 0 0 21 0 0 0 16 19 23 21 0 0 0 0 0 0 21 21 21 22 0 0 0 0 0 0 0 0 0 0 0 0 18 18 23 22 23 23 15 15 17 0 23 15 21 22 21 21 21 25 19 23 19 30 18 18 18 27 21 0 0 0 0 0 0 25 18 22 15 0 0 0 15 0 21 17 0 0 0 0 0 18 0 0 0 16 16 21 16 8 0 0 0 0 0 16 16 0 0 0 0 0 0 15 15 0 0 15 15 0 0 0 0 0 0 0 18 18 21 20 20 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 19 19 9 0 0 0 0 0 0 0 0 0 0 0 0 16 8 0 0 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0
Note how the Align_to_SCF is now much more complex since it has to take into account all the pads inserted into the dna. The DNA contains pads “-” and the BaseQuality data has been interpolated at each pad position. The alignment of the read to the contig is now much simpler however:
Sequence : Contig3 Is_contig Padded . . Assembled_from hg02b9.s1 8618 8080 1 539 . .
Further information
The CAF format was based on the AceDB dump format, so it has a corresponding AceDB model.