[DAS2] best practices / DAS2 format examples

Helt,Gregg Gregg_Helt at affymetrix.com
Mon Sep 11 20:07:05 UTC 2006


> -----Original Message-----
> From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-
> bio.org] On Behalf Of Andrew Dalke
> Sent: Monday, September 11, 2006 10:53 AM
> To: DAS/2
> Subject: [DAS2] best practices / DAS2 format examples
> 
> das2-teleconf-2006-03-16.txt
> > [A] Lincoln will provide use cases/examples of these features
> > scenarios:
> > - three or greater hierarchy features
> > - multiple parents
> > - alignments
> 
> I really would like some real-world examples of these.  I don't know
> enough to make decent examples for the documentation and I think it
> would be very useful so others can see how to model existing data
> in DAS2 XML.

I found a previous post from Lincoln with attached alignment examples:

> -----Original Message-----
> From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-
> bio.org] On Behalf Of Lincoln Stein
> Sent: Monday, June 05, 2006 7:32 AM
> To: Andrew Dalke
> Cc: DAS/2
> Subject: [DAS2] Example alignments
> 
> Hi Andrew,
> 
> I'm truly sorry at how long it has taken me to get these examples to
you.
> I hope that the example alignments in the enclosure makes sense to
you.
> 
> Unfortunately I found that I had to add a new "target" attribute to
<LOC>
> in order to make the cigar string semantics unambiguous. Otherwise you
> wouldn't be able to tell how to interpret the gaps.
> 
> Lincoln
> 

CASE #1. A SIMPLE PAIRWISE ALIGNMENT.

A simple alignment is one in which the alignment is represented as a
single feature with no subfeatures. This is the preferred
representation to be used when the entire alignment shares the same
set of properties.

This is an alignment between Chr3 (the reference) and EST23 (the
target). Both aligned sequences are in the forward (+) direction. We
represent this as a single alignment

Chr4       100 CAAGACCTAAA-CTGGAATTCCAATCGCAACTCCTGGACC-TATCTATA 147
               |||||||X||| ||||| |||||||       ||||X||| ||||||||
EST23        1 CAAGACCAAAATCTGGA-TTCCAAT-------CCTGCACCCTATCTATA  41

This has a CIGAR gap string of M11 I1 M5 D1 M7 D7 M8 I1 M8:

     M11  match 11 bp
     I1   insert 1 gap into the reference sequence
     M5   match 5 bp
     D1   insert 1 gap into the target sequence
     M7   match 7 bp
     D7   insert 7 gaps into the target
     M8   match 8 bp
     I1   insert 1 gap into the reference
     M8   match 8 bp

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES
     xmlns="http://biodas.org/documents/das2"
     xml:base="http://www.biodas.org/das2/sequence/fly/Jun2006/">

<FEATURE uri="./Alignment1" type="./expressed_sequence_match" >
  <LOC
       segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
       range="100:147:1"
   </LOC>
   <LOC
 
segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/EST23"
       target="1"
       range="1:41:1"
       gap="M11 I1 M5 D1 M7 D7 M8 I1 M8"
    </LOC>
    <PROP key="est2genomescore" value="180" />
</FEATURE>
    
</FEATURES>

NOTE: I've had to introduce a new <LOC> attribute named "target" in
order to distinguish the reference sequence from the target
sequence. This is necessary for the CIGAR string concepts to work.

Perhaps it would be better to have a "role" attribute whose values are
one of "ref" and "target?"

<!----------------------------------------------------------------------
->

CASE #2. A COMPLEX PAIRWISE ALIGNMENT.

The complex pairwise alignment is used when the alignment is the
composite of two different alignments, each of which has its own set
of properties. An example of this is BLAST, in which each "BLAST hit"
is composed of multiple aligned segments called "HSPs".

We extend the previous example by adding another aligned segment to
the alignment.

BLAST hit: align Chr4:100:300 with EST23:1:58

HSP 1:

Chr4       100 CAAGACCTAAA-CTGGAATTCCAATCGCAACTCCTGGACC-TATCTATA 147
               |||||||X||| ||||| |||||||       ||||X||| ||||||||
EST23        1 CAAGACCAAAATCTGGA-TTCCAAT-------CCTGCACCCTATCTATA  41

BLAST score = 80

CIGAR gap string M11 I1 M5 D1 M7 D7 M8 I1 M8:


HSP 2:

Chr4       211 TCAAACTGATAATGGGGT 228
               ||||||||||| ||||||
EST23       42 TCAAACTGATA-TGGGGT  58

BLAST score = 85

CIGAR gap string M11 D1 M6

We represent this as an "expressed_sequence_match" feature relating
Chr4 100:300 to EST23 1:58. The feature contains two subparts, one
corresponding to the HSP1 and the other corresponding to HSP2.

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES
     xmlns="http://biodas.org/documents/das2"
     xml:base="http://www.biodas.org/das2/sequence/fly/Jun2006/">

  <!-- A feature for the entire BLAST hit -->

   <FEATURE uri="./Alignment2" type="./expressed_sequence_match" >
     <LOC
 
segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
          range="100:300:1"
      </LOC>
      <LOC
 
segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/EST23"
          target="1"
          range="1:58:1"
       </LOC>
       <PART uri="./Alignment2.1" />
       <PART uri="./Alignment2.2" />
   </FEATURE>

  <!-- HSP 1 -->
   <FEATURE uri="./Alignment2.1" type="./match_part">
     <LOC
 
segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
          range="100:147:1"
      </LOC>
      <LOC
 
segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/EST23"
          target="1"
          range="1:41:1"
          gap="M11 I1 M5 D1 M7 D7 M8 I1 M8"
       </LOC>
       <PARENT uri="./Alignment2" />
       <PROP key="blastscore" value="80" />
   </FEATURE>
    
  <!-- HSP 2 -->
   <FEATURE uri="./Alignment2.2" type="./match_part">
     <LOC
 
segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
          range="211:228:1"
      </LOC>
      <LOC
 
segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/EST23"
          target="1"
          range="42:58:1"
          gap="M11 D1 M6"
       </LOC>
       <PARENT uri="./Alignment2" />
       <PROP key="blastscore" value="85" />
   </FEATURE>

</FEATURES>




More information about the DAS2 mailing list