[Bioperl-l] alignIO::fasta bug

Mon Jan 26 03:51:33 UTC 2009

Not to mention the wiki's malleable and editable by anyone (so we can  
all work on ideas for the interface and tests, then start coding  
towards a decent solution).

Jason, any additional ideas?  I would like to hammer out the specifics  
on how to deal with various symbols, how we extract a subsequence via  
subseq(), etc.

-c

On Jan 25, 2009, at 9:10 PM, Mark A. Jensen wrote:

> Chris--
> Yes, I see what you mean. Trickier than I thought. Sounds like its  
> time to 'mature' LocatableSeq.
> Wiki's a great place for that-
> MAJ
> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu 
> >
> To: "Mark A. Jensen" <maj at fortinbras.us>
> Cc: "Jason Stajich" <jason at bioperl.org>; "BioPerl List" <bioperl-l at bioperl.org 
> >
> Sent: Saturday, January 24, 2009 12:25 AM
> Subject: Re: [Bioperl-l] alignIO::fasta bug
>
>
>>
>> On Jan 22, 2009, at 10:54 PM, Mark A. Jensen wrote:
>>
>>> couple of thoughts...
>>>
>>> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
>>> >
>>> To: "Jason Stajich" <jason at bioperl.org>
>>> Cc: "BioPerl List" <bioperl-l at bioperl.org>
>>> Sent: Thursday, January 22, 2009 10:54 PM
>>> Subject: Re: [Bioperl-l] alignIO::fasta bug
>>>
>>>> On Jan 22, 2009, at 3:33 PM, Jason Stajich wrote:
>>>>
>>>>> FYI - it appears that if the last sequence in a FASTA MSA is  
>>>>> all   gaps (which happens in some whole genome synteny+multiple   
>>>>> alignment  chunking like Mercator) no alignment is returned.
>>>>>
>>>>> yuck. It basically comes down to this bit of code where $end   
>>>>> would  equal sequence length.
>>>>>
>>>>> #  If $end <= 0, we have either reached the end of
>>>>> #  file in <> or we have encountered some other error
>>>>> if ( $end <= 0 &&  ) {
>>>>> undef $aln;
>>>>> return $aln;
>>>>> }
>>> [haven't looked at the code, but] On the surface, this looks like  
>>> a  bit more
>>> responsibility than $end should be expected to handle, so I like   
>>> Jason's
>>> solution below better, which is only masquerading as a kludge.
>>>
>>>>>
>>>>> This start/end requirement of locatable seq is nice but kind of  
>>>>> a   pain where I am managing the map of sequences outside of   
>>>>> alignment  chunk.
>>>>
>>>> I faced this issue with Mauve parsing (Bio::SeqIO::xmfa).  I set  
>>>> it  up  so LocatableSeqs can now have all gaps, but the alphabet  
>>>> needs  to be  set (get a warning otherwise) and start and end  
>>>> need to be  initiated  to 0, which is how I believe Mauve defined  
>>>> these.   However, should 0-0  be a valid start/end for such a  
>>>> sequence?   Should we change that to  automatically allow start =  
>>>> end = X (any  position including 0) if a  sequence is all gaps or  
>>>> empty?
>>>
>>> I don't like any old start==end implying zero length, even under  
>>> the condition
>>> that the underlying sequence is empty, since in the "1-origin,   
>>> endpoints" model
>>> that pervades BP (as opposed to the "0-origin, length" model of,   
>>> say, substr),
>>> the pair ($start, $end) has the strong connotation of "the  
>>> residue  at $start"
>>> if $start==$end. (At least, it does now that we've fixed   
>>> LocatableSeq...)
>>
>> It could just as easily be undef (no start/end).  However, see  
>> below...
>>
>>> What if we consider 0 to be special, the 'sequence anchor', that   
>>> takes up no
>>> real space? I'm thinking of 'point' and 'mark' in emacs, that   
>>> actually point at
>>> the interstices between characters, and not the characters   
>>> themselves. Or \G
>>> for something perly. Then $start means not just the coordinate,  
>>> but  the
>>> 'space' before the residue at $start,  and $end means the 'space'   
>>> after the
>>> residue at $end. If $start==$end > 0, how many residues between   
>>> $start and $end?
>>> One. If $start == $end == 0, how many residues? None, because the   
>>> anchor
>>> is special, it doesn't take up residues.
>>>
>>> If a sequence is all gaps, what's its length? It has an  
>>> understood  anchor at 0,
>>> then the gap symbols are removed, so its length is the length of  
>>> the  anchor
>>> alone, which is zero, and $start = 0 is the 'space' before the   
>>> anchor, and
>>> $end = 0 is the 'space' after the anchor.
>>
>> Where this becomes sticky is taking alignment sections, a problem   
>> encountered when attempting to create interleaved output, for   
>> instance.  Let's say (for example) you have a long alignment that  
>> has  large gaps (think genome alignment ala MUMmer or Mauve).  If  
>> you take  a subsection of that alignment, you could very possibly  
>> end up with a  sequence that is all gaps.  Here is an example from  
>> Rfam (note  AE013109.1):
>>
>> AL596170.1/181975-181872  
>> AG................................................
>> BA000016.3/1419453-1419329  
>> UU..........................................AAUUAU
>> AE001437.1/2206791-2206643  
>> AGGAAUUAUUUUGAAGAACUUUAUAGACGUAGAA..........GAAAAA
>> BA000016.3/1419328-1419245  
>> GG................................................
>> AE013183.1/26-124 AG.............................................GAU
>> AE013109.1/12950-12813  
>> GAGA........................................AUAUAG
>> AE013109.1 / 13583  
>> -13495 ..................................................
>> AF044978.1/176-335 AAUC........................................AUGCAA
>> X84262.1/148-332 CAGCUCAAAAUCCAUAUAUAA.......................ACAAGA
>> CR954253.1/898444-898260  
>> UCUCUCAAAGUCUA..............................CUUAAA
>>
>> So let's say we take the slice above from the alignment and create  
>> a  new SimpleAlign.  We know the original coordinates for each   
>> LocatableSeq; should we override them and mark the start = end =  
>> 0?   Or would we want to have a sequence that is indicated as  
>> between the  end of the last known position and the beginning of  
>> the next  (500^501)?  This also doesn't take into account start/end  
>> gaps; maybe  for beginning gaps start = end = 0 and end gaps start  
>> = end = length.
>>
>> For no sequence at all (undef) start = end = undef as well.
>>
>>> Now this would mean say, $s->subseq(0, $n) eq $s->subseq(1, $n).  
>>> I  don't
>>> think this is too troubling, since 1) it's consistent with the   
>>> concept above, and
>>> 2) zeros would only show up when the empty sequences are  
>>> encountered.
>>
>> Okay.
>>
>>>> If we come up with some rough ideas of how to handle this we can   
>>>> add  some examples to the test suite and try getting  
>>>> LocatableSeq  to do the  right thing.  We can always mark them as  
>>>> TODO.
>>>>
>>>>> Why not just check to see that the number of seq characters is  
>>>>> 0  -  an all-gapped sequence as the last sequence of the file  
>>>>> should  still  be legal.
>>>>>
>>>>> Instead:
>>>>> if ( length($seqchar) == 0 ) {
>>>>> undef $aln;
>>>>> return $aln;
>>>>> }
>>>>>
>>>>> Although that would invalidate an empty alignment like this --  
>>>>> do  we  want to still permit these?
>>>>> >A
>>>>>
>>>>> >B
>>>>>
>>>>> >C
>>>>
>>>> These could be zero-length, empty seqs (start = end = undef).   
>>>> I   thought something was added to PrimarySeq recently for empty  
>>>> seqs.
>>>
>>> Maybe the above concept would encompass empty sequences (gapped
>>> or ungapped) without recourse to undef.
>>>
>>> cheers, Mark
>>
>> From this discussion I think it's worth coming up with some kind  
>> of  idea on what we expect LocatableSeq to do under various  
>> circumstances;  the wiki is a good place to start something up.  I  
>> don't think that  refactoring everything is necessary, but it might  
>> be worth putting out  some variations for consideration.
>>
>> For instance I have had the idea of making a new RangeI-based   
>> Bio::SeqI that would just delegate the RangeI methods to a  
>> 'source'  SeqFeature containing a Bio::LocationI.  The PrimarySeqI  
>> would be a  simplified LocatableSeq capable of dealing with subseq  
>> issues as  discussed previously, but simplified. In fact, it should  
>> probably  recognize any PrimarySeqI.
>>
>> Advantages: can add annotation and features specific for the  
>> sequence,  and (beyond simple decorated methods used to make access  
>> simple) it is  a Bio::Seq and could be persisted in a BioSQL.   
>> Disadvantages: lots of  objects and thus slower.
>>
>> chris
>>
>