[Bioperl-l] alignIO::fasta bug

Mon Jan 26 03:10:08 UTC 2009

Chris--
Yes, I see what you mean. Trickier than I thought. Sounds like its time to 
'mature' LocatableSeq.
Wiki's a great place for that-
MAJ
----- Original Message ----- 
From: "Chris Fields" <cjfields at illinois.edu>
To: "Mark A. Jensen" <maj at fortinbras.us>
Cc: "Jason Stajich" <jason at bioperl.org>; "BioPerl List" <bioperl-l at bioperl.org>
Sent: Saturday, January 24, 2009 12:25 AM
Subject: Re: [Bioperl-l] alignIO::fasta bug


>
> On Jan 22, 2009, at 10:54 PM, Mark A. Jensen wrote:
>
>> couple of thoughts...
>>
>> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
>> >
>> To: "Jason Stajich" <jason at bioperl.org>
>> Cc: "BioPerl List" <bioperl-l at bioperl.org>
>> Sent: Thursday, January 22, 2009 10:54 PM
>> Subject: Re: [Bioperl-l] alignIO::fasta bug
>>
>>> On Jan 22, 2009, at 3:33 PM, Jason Stajich wrote:
>>>
>>>> FYI - it appears that if the last sequence in a FASTA MSA is all   gaps 
>>>> (which happens in some whole genome synteny+multiple  alignment  chunking 
>>>> like Mercator) no alignment is returned.
>>>>
>>>> yuck. It basically comes down to this bit of code where $end  would  equal 
>>>> sequence length.
>>>>
>>>> #  If $end <= 0, we have either reached the end of
>>>> #  file in <> or we have encountered some other error
>>>> if ( $end <= 0 &&  ) {
>>>> undef $aln;
>>>> return $aln;
>>>> }
>> [haven't looked at the code, but] On the surface, this looks like a  bit more
>> responsibility than $end should be expected to handle, so I like  Jason's
>> solution below better, which is only masquerading as a kludge.
>>
>>>>
>>>> This start/end requirement of locatable seq is nice but kind of a   pain 
>>>> where I am managing the map of sequences outside of  alignment  chunk.
>>>
>>> I faced this issue with Mauve parsing (Bio::SeqIO::xmfa).  I set it  up  so 
>>> LocatableSeqs can now have all gaps, but the alphabet needs  to be  set (get 
>>> a warning otherwise) and start and end need to be  initiated  to 0, which is 
>>> how I believe Mauve defined these.   However, should 0-0  be a valid 
>>> start/end for such a sequence?   Should we change that to  automatically 
>>> allow start = end = X (any  position including 0) if a  sequence is all gaps 
>>> or empty?
>>
>> I don't like any old start==end implying zero length, even under the 
>> condition
>> that the underlying sequence is empty, since in the "1-origin,  endpoints" 
>> model
>> that pervades BP (as opposed to the "0-origin, length" model of,  say, 
>> substr),
>> the pair ($start, $end) has the strong connotation of "the residue  at 
>> $start"
>> if $start==$end. (At least, it does now that we've fixed  LocatableSeq...)
>
> It could just as easily be undef (no start/end).  However, see below...
>
>> What if we consider 0 to be special, the 'sequence anchor', that  takes up no
>> real space? I'm thinking of 'point' and 'mark' in emacs, that  actually point 
>> at
>> the interstices between characters, and not the characters  themselves. Or \G
>> for something perly. Then $start means not just the coordinate, but  the
>> 'space' before the residue at $start,  and $end means the 'space'  after the
>> residue at $end. If $start==$end > 0, how many residues between  $start and 
>> $end?
>> One. If $start == $end == 0, how many residues? None, because the  anchor
>> is special, it doesn't take up residues.
>>
>> If a sequence is all gaps, what's its length? It has an understood  anchor at 
>> 0,
>> then the gap symbols are removed, so its length is the length of the  anchor
>> alone, which is zero, and $start = 0 is the 'space' before the  anchor, and
>> $end = 0 is the 'space' after the anchor.
>
> Where this becomes sticky is taking alignment sections, a problem  encountered 
> when attempting to create interleaved output, for  instance.  Let's say (for 
> example) you have a long alignment that has  large gaps (think genome 
> alignment ala MUMmer or Mauve).  If you take  a subsection of that alignment, 
> you could very possibly end up with a  sequence that is all gaps.  Here is an 
> example from Rfam (note  AE013109.1):
>
> AL596170.1/181975-181872 
> AG................................................
> BA000016.3/1419453-1419329 
> UU..........................................AAUUAU
> AE001437.1/2206791-2206643 
> AGGAAUUAUUUUGAAGAACUUUAUAGACGUAGAA..........GAAAAA
> BA000016.3/1419328-1419245 
> GG................................................
> AE013183.1/26-124 
> AG.............................................GAU
> AE013109.1/12950-12813 
> GAGA........................................AUAUAG
> AE013109.1 / 13583 -13495 ..................................................
> AF044978.1/176-335 
> AAUC........................................AUGCAA
> X84262.1/148-332 
> CAGCUCAAAAUCCAUAUAUAA.......................ACAAGA
> CR954253.1/898444-898260 
> UCUCUCAAAGUCUA..............................CUUAAA
>
> So let's say we take the slice above from the alignment and create a  new 
> SimpleAlign.  We know the original coordinates for each  LocatableSeq; should 
> we override them and mark the start = end = 0?   Or would we want to have a 
> sequence that is indicated as between the  end of the last known position and 
> the beginning of the next  (500^501)?  This also doesn't take into account 
> start/end gaps; maybe  for beginning gaps start = end = 0 and end gaps start = 
> end = length.
>
> For no sequence at all (undef) start = end = undef as well.
>
>> Now this would mean say, $s->subseq(0, $n) eq $s->subseq(1, $n). I  don't
>> think this is too troubling, since 1) it's consistent with the  concept 
>> above, and
>> 2) zeros would only show up when the empty sequences are encountered.
>
> Okay.
>
>>> If we come up with some rough ideas of how to handle this we can  add  some 
>>> examples to the test suite and try getting LocatableSeq  to do the  right 
>>> thing.  We can always mark them as TODO.
>>>
>>>> Why not just check to see that the number of seq characters is 0  -  an 
>>>> all-gapped sequence as the last sequence of the file should  still  be 
>>>> legal.
>>>>
>>>> Instead:
>>>> if ( length($seqchar) == 0 ) {
>>>> undef $aln;
>>>> return $aln;
>>>> }
>>>>
>>>> Although that would invalidate an empty alignment like this -- do  we  want 
>>>> to still permit these?
>>>> >A
>>>>
>>>> >B
>>>>
>>>> >C
>>>
>>> These could be zero-length, empty seqs (start = end = undef).  I   thought 
>>> something was added to PrimarySeq recently for empty seqs.
>>
>> Maybe the above concept would encompass empty sequences (gapped
>> or ungapped) without recourse to undef.
>>
>> cheers, Mark
>
> From this discussion I think it's worth coming up with some kind of  idea on 
> what we expect LocatableSeq to do under various circumstances;  the wiki is a 
> good place to start something up.  I don't think that  refactoring everything 
> is necessary, but it might be worth putting out  some variations for 
> consideration.
>
> For instance I have had the idea of making a new RangeI-based  Bio::SeqI that 
> would just delegate the RangeI methods to a 'source'  SeqFeature containing a 
> Bio::LocationI.  The PrimarySeqI would be a  simplified LocatableSeq capable 
> of dealing with subseq issues as  discussed previously, but simplified. In 
> fact, it should probably  recognize any PrimarySeqI.
>
> Advantages: can add annotation and features specific for the sequence,  and 
> (beyond simple decorated methods used to make access simple) it is  a Bio::Seq 
> and could be persisted in a BioSQL.  Disadvantages: lots of  objects and thus 
> slower.
>
> chris
>
>