[Bioperl-l] sanity check my understanding of bioperl's location terminology?

Heikki Lehvaslaiho heikki at sanbi.ac.za
Sat Oct 18 06:47:52 UTC 2008


George,

You've got it all right expect the last bit.

Bio::LocatableSeq::column_from_residue_number() is a special case because it's 
input is in original sequence coordinates, of which the LocatableSeq in 
question is part of.

Look at the tests. I just added a few on column_from_residue_number() that 
show that if you take a revcom() on a LocatebleSeq the outcome of this method 
remains the same! The reason is that within a same alignment, a revcomed 
sequence is not a the same one any more. You can not put it back into the same 
alignment.

The following demonstrates it by taking two sequences that happen to have 
almost identical in the wrong strand (I hope I did not mess this up by doing 
it by hand):

seq1  ttaccta
seq2  atgctat

      1234567890123
seq1  --atg---gtaa-  -1
seq2  --atg---ctat-   1


is $seq1->column_from_residue_number(5),5;
is $seq1->column_from_residue_number(4),9;

is $seq2->column_from_residue_number(5),10;
is $seq2->column_from_residue_number(4),9;

Maybe Ewan can be dragged from his bioperl retirement to point us to an old 
document somewhere that explains all the logic behind the way strand is used 
in bioperl?

   -Heikki


On Saturday 18 October 2008 00:03:42 George Hartzell wrote:
> Hi All,
>
> I'm not sure that I'm understanding Bioperl's location terminology and
> how it's carried through into some basic technology like the
> LocatableSeqs.  Hopefully this is more about communicating in the
> shared bioperl language and I'm not just demonstrating how much
> biology I've forgotten.  This is being driven by my gmap SearchIO
> parser (which hopefully will get committed at some point), which
> currently returns GenericHSPs in GenericHits and from which I can
> retrieve SimpleAlign's.  I'm just not sure that I'm translating from
> gmap-speak to bioperl-speak correctly (assumptions, it's always
> assumptions...).
>
> There's the basic truism from e.g. Bio::Range:
>
>   length = end - start + 1
>   end >= start
>   strand = (-1 | 0 | +1)
>
> So if I have seq_id => foo
>
>   5' AACTGTTTGG  3'
>      1   5    1
>               0
>
> so
>
>    -start => 3, -end => 6, -strand => +1 would be: CTGT
>
> and
>
>    -start => 4, -end => 4, -strand => +1 would be: T
>
> Things get goofier when strand is -1, but I'm pretty confident that
> one would say
>
>    -start => 4, -end => 4, -strand => -1 would be: A
>
> (but I'm worried that I should say it's T, or the reverse compliment
>  of T or something complicated)

You are taking the reverse compliment  of T , that is A.

> and slightly less confident that one would say
>
>    -start => 3, -end => 6, -strand => -1 would be: ACAG
>
> (in other words, always spelling the sequence out 5' to 3' from the
>  reverse strand in that range).

Yes.

> Where I really get shaky about how I understand how things are
> supposed to be said is when LocatableSeqs get involved.
>
> If I have a simple align that contains a row w/
>
>   -seq_id => foo, -start => 3, -end => 6, -strand => 1
>
> then I think that the row might look like this in an alignment:
>
> moose CTCT
> foo   CTGT
> bar   C-GT
>
> and that if
>
> moose AGAG
> foo   ACAG
> bar   AC-G
>
> were the rows in the alignment then it's info would be
>
>   -seq_id => foo, -start => 3, -end => 6, -strand => -1.
>
> Now here's where I really loose faith in my understanding.  There's
> code in Bio::LocatableSeq::column_from_residue_number (around line
> 301) that tests the strand and
>
>   if strand == -1 then it counts back from the -end coordinate towards
>     the -start
>   but if it's +1 then it counts up from 0 towards the -end.

and it adds one to the final result.

> That only makes sense if the sequence string that's contained in the
> locatable seq that's part of the simple alignment is actually the
> forward strand, in which case the alignment would *look* really goofy
> visually.  If on the other hand the string in the alignment were the
> 5' to 3' writing of reverse strand (which would look like it aligns
> with the other strings in the alignment because the bases would match)
> then the counting seems messed up.
>
> If someone wants to round out my understanding and squash any bugs in
> it, I can try to use it to seed a Location HowTo or something.
>
> Thanks,
>
> g.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/
     _/  _/  _/  Heikki Lehvaslaiho    heikki at_sanbi _ac _za
    _/_/_/_/_/  Senior Scientist    skype: heikki_lehvaslaiho
   _/  _/  _/  SANBI, South African National Bioinformatics Institute
  _/  _/  _/  University of Western Cape, South Africa
     _/      Phone: +27 21 959 2096   FAX: +27 21 959 2512
___ _/_/_/_/_/________________________________________________________




More information about the Bioperl-l mailing list