[Bioperl-l] sanity check my understanding of bioperl's location terminology?

Fri Oct 17 22:03:42 UTC 2008

Hi All,

I'm not sure that I'm understanding Bioperl's location terminology and
how it's carried through into some basic technology like the
LocatableSeqs.  Hopefully this is more about communicating in the
shared bioperl language and I'm not just demonstrating how much
biology I've forgotten.  This is being driven by my gmap SearchIO
parser (which hopefully will get committed at some point), which
currently returns GenericHSPs in GenericHits and from which I can
retrieve SimpleAlign's.  I'm just not sure that I'm translating from
gmap-speak to bioperl-speak correctly (assumptions, it's always
assumptions...).

There's the basic truism from e.g. Bio::Range:

  length = end - start + 1
  end >= start
  strand = (-1 | 0 | +1)

So if I have seq_id => foo

  5' AACTGTTTGG  3'
     1   5    1
              0

so

   -start => 3, -end => 6, -strand => +1 would be: CTGT

and

   -start => 4, -end => 4, -strand => +1 would be: T

Things get goofier when strand is -1, but I'm pretty confident that
one would say

   -start => 4, -end => 4, -strand => -1 would be: A

(but I'm worried that I should say it's T, or the reverse compliment
 of T or something complicated)

and slightly less confident that one would say

   -start => 3, -end => 6, -strand => -1 would be: ACAG

(in other words, always spelling the sequence out 5' to 3' from the
 reverse strand in that range).

Where I really get shaky about how I understand how things are
supposed to be said is when LocatableSeqs get involved.

If I have a simple align that contains a row w/

  -seq_id => foo, -start => 3, -end => 6, -strand => 1 

then I think that the row might look like this in an alignment:

moose CTCT
foo   CTGT
bar   C-GT

and that if 

moose AGAG
foo   ACAG
bar   AC-G

were the rows in the alignment then it's info would be

  -seq_id => foo, -start => 3, -end => 6, -strand => -1.

Now here's where I really loose faith in my understanding.  There's
code in Bio::LocatableSeq::column_from_residue_number (around line
301) that tests the strand and

  if strand == -1 then it counts back from the -end coordinate towards
    the -start
  but if it's +1 then it counts up from 0 towards the -end.

That only makes sense if the sequence string that's contained in the
locatable seq that's part of the simple alignment is actually the
forward strand, in which case the alignment would *look* really goofy
visually.  If on the other hand the string in the alignment were the
5' to 3' writing of reverse strand (which would look like it aligns
with the other strings in the alignment because the bases would match)
then the counting seems messed up.

If someone wants to round out my understanding and squash any bugs in
it, I can try to use it to seed a Location HowTo or something.

Thanks,

g.