[DAS2] alignments

Andrew Dalke dalke at dalkescientific.com
Fri Apr 14 08:29:46 UTC 2006

I need a bit of help here.  I'm trying to hand-write an example of a
feature based on an alignment.  Let's assume these are annotations on
fly and it's aligned to human.  There's a hit from

fly chromosome 4
   range 100:200

to human chromosome 8
   range 200:300

Assume the CIGAR string of the match is
    51 identical, 3 insertions, 24 identical, 3 deletions, 25 identical

Here's the best I can manage:

   <FEATURES xmlns="http://biodas.org/document/das2/">
    <FEATURE uri="feature/00094" type="type/alignment" title="Human  
genome alignment">
        range="100:200" cigar="?????"/>

First question:
   Where do I put the object to which the alignment aligns?  Will
it be a segment or a feature?  Now, I could have this completely wrong
and DAS2 is not meant for genome/genome alignments like this.  If
that's the case please offer an example of how to write an alignment.

Second question:
   What's the format of the CIGAR string?  Lincoln's text pointed to

That documentation says:
> The format starts with the same 9 fields as sugar output (see above),  
> and is followed by a series of <operation, length> pairs where  
> operation is one of match, insert or delete, and the length describes  
> the number of times this operation is repeated.

However, it does not list the operation characters nor if there are  
between the fields.  I assume it is "M 51 I 3 M 24 D 3 25 I", though  
without spaces.

The GFF3 documentation at http://song.sourceforge.net/gff3.shtml refers  
but I can find no relevant documentation there.

I then found a comment by Richard Durbin from two years ago, at

> 3) I'm not convinced by the format for the Align string.  This requires
> a character per aligned base.  There are a variety of run-length type
> encodings in common use that are much more compact.  e.g. Ensembl uses  
> a
> string such as "60M1D8M3I15M" to mean "60 match, then 1 delete, then 8
> match, then 3 insert, then 15 match".  They call this CIGAR, but when I
> talked to Guy Slater, who invented CIGAR for exonerate, his version is
> subtly different: "M 60 D 1 M 8 I 3 M 15" for the same string (see
> http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html).
> Jim Kent also has something like this.  I'd prefer us to standardise on
> one of these formats, all of which are very short for ungapped matches.

Which is the CIGAR string format DAS2 supports?  Where is the
documentation for it?

