[DAS2] alignments

Fri Apr 14 08:29:46 UTC 2006

I need a bit of help here.  I'm trying to hand-write an example of a
feature based on an alignment.  Let's assume these are annotations on
fly and it's aligned to human.  There's a hit from

fly chromosome 4
   http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4
   range 100:200

to human chromosome 8
   http://www.ensembl.org/Homo_sapiens/Chr1
   range 200:300

Assume the CIGAR string of the match is
    51 identical, 3 insertions, 24 identical, 3 deletions, 25 identical

Here's the best I can manage:

   <FEATURES xmlns="http://biodas.org/document/das2/">
    <FEATURE uri="feature/00094" type="type/alignment" title="Human  
genome alignment">
     <LOC  
segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
        range="100:200" cigar="?????"/>
    </FEATURE>
   </FEATURE>

First question:
   Where do I put the object to which the alignment aligns?  Will
it be a segment or a feature?  Now, I could have this completely wrong
and DAS2 is not meant for genome/genome alignments like this.  If
that's the case please offer an example of how to write an alignment.

Second question:
   What's the format of the CIGAR string?  Lincoln's text pointed to
      http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.1.html

That documentation says:
> The format starts with the same 9 fields as sugar output (see above),  
> and is followed by a series of <operation, length> pairs where  
> operation is one of match, insert or delete, and the length describes  
> the number of times this operation is repeated.

However, it does not list the operation characters nor if there are  
spaces
between the fields.  I assume it is "M 51 I 3 M 24 D 3 25 I", though  
perhaps
without spaces.

The GFF3 documentation at http://song.sourceforge.net/gff3.shtml refers  
to
  http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate?cvsroot=Ensembl
but I can find no relevant documentation there.

I then found a comment by Richard Durbin from two years ago, at

http://portal.open-bio.org/pipermail/bioperl-l/2003-February/ 
011234.html

> 3) I'm not convinced by the format for the Align string.  This requires
> a character per aligned base.  There are a variety of run-length type
> encodings in common use that are much more compact.  e.g. Ensembl uses  
> a
> string such as "60M1D8M3I15M" to mean "60 match, then 1 delete, then 8
> match, then 3 insert, then 15 match".  They call this CIGAR, but when I
> talked to Guy Slater, who invented CIGAR for exonerate, his version is
> subtly different: "M 60 D 1 M 8 I 3 M 15" for the same string (see
> http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html).
> Jim Kent also has something like this.  I'd prefer us to standardise on
> one of these formats, all of which are very short for ungapped matches.

Which is the CIGAR string format DAS2 supports?  Where is the
documentation for it?

					Andrew
					dalke at dalkescientific.com