[DAS2] alignments

Mon Apr 17 13:46:23 UTC 2006

I didn't realize there were multiple things called CIGAR. I think we should 
use Ensembl CIGAR format.

The target of the alignment should be a segment, and not another feature.

Best,

Lincoln

On Friday 14 April 2006 04:29, Andrew Dalke wrote:
> I need a bit of help here.  I'm trying to hand-write an example of a
> feature based on an alignment.  Let's assume these are annotations on
> fly and it's aligned to human.  There's a hit from
>
> fly chromosome 4
>    http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4
>    range 100:200
>
> to human chromosome 8
>    http://www.ensembl.org/Homo_sapiens/Chr1
>    range 200:300
>
> Assume the CIGAR string of the match is
>     51 identical, 3 insertions, 24 identical, 3 deletions, 25 identical
>
> Here's the best I can manage:
>
>    <FEATURES xmlns="http://biodas.org/document/das2/">
>     <FEATURE uri="feature/00094" type="type/alignment" title="Human
> genome alignment">
>      <LOC
> segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
>         range="100:200" cigar="?????"/>
>     </FEATURE>
>    </FEATURE>
>
>
> First question:
>    Where do I put the object to which the alignment aligns?  Will
> it be a segment or a feature?  Now, I could have this completely wrong
> and DAS2 is not meant for genome/genome alignments like this.  If
> that's the case please offer an example of how to write an alignment.
>
>
> Second question:
>    What's the format of the CIGAR string?  Lincoln's text pointed to
>       http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.1.html
>
> That documentation says:
> > The format starts with the same 9 fields as sugar output (see above),
> > and is followed by a series of <operation, length> pairs where
> > operation is one of match, insert or delete, and the length describes
> > the number of times this operation is repeated.
>
> However, it does not list the operation characters nor if there are
> spaces
> between the fields.  I assume it is "M 51 I 3 M 24 D 3 25 I", though
> perhaps
> without spaces.
>
> The GFF3 documentation at http://song.sourceforge.net/gff3.shtml refers
> to
>   http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate?cvsroot=Ensembl
> but I can find no relevant documentation there.
>
> I then found a comment by Richard Durbin from two years ago, at
>
> http://portal.open-bio.org/pipermail/bioperl-l/2003-February/
> 011234.html
>
> > 3) I'm not convinced by the format for the Align string.  This requires
> > a character per aligned base.  There are a variety of run-length type
> > encodings in common use that are much more compact.  e.g. Ensembl uses
> > a
> > string such as "60M1D8M3I15M" to mean "60 match, then 1 delete, then 8
> > match, then 3 insert, then 15 match".  They call this CIGAR, but when I
> > talked to Guy Slater, who invented CIGAR for exonerate, his version is
> > subtly different: "M 60 D 1 M 8 I 3 M 15" for the same string (see
> > http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html).
> > Jim Kent also has something like this.  I'd prefer us to standardise on
> > one of these formats, all of which are very short for ungapped matches.
>
> Which is the CIGAR string format DAS2 supports?  Where is the
> documentation for it?
>
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu