[Bioperl-l] alignIO::fasta bug

Mark A. Jensen maj at fortinbras.us
Fri Jan 23 04:54:28 UTC 2009


couple of thoughts...

----- Original Message ----- 
From: "Chris Fields" <cjfields at illinois.edu>
To: "Jason Stajich" <jason at bioperl.org>
Cc: "BioPerl List" <bioperl-l at bioperl.org>
Sent: Thursday, January 22, 2009 10:54 PM
Subject: Re: [Bioperl-l] alignIO::fasta bug


>
> On Jan 22, 2009, at 3:33 PM, Jason Stajich wrote:
>
>> FYI - it appears that if the last sequence in a FASTA MSA is all  gaps (which 
>> happens in some whole genome synteny+multiple alignment  chunking like 
>> Mercator) no alignment is returned.
>>
>> yuck. It basically comes down to this bit of code where $end would  equal 
>> sequence length.
>>
>> #  If $end <= 0, we have either reached the end of
>> #  file in <> or we have encountered some other error
>> if ( $end <= 0 &&  ) {
>> undef $aln;
>> return $aln;
>> }
[haven't looked at the code, but] On the surface, this looks like a bit more
responsibility than $end should be expected to handle, so I like Jason's
solution below better, which is only masquerading as a kludge.

>>
>> This start/end requirement of locatable seq is nice but kind of a  pain where 
>> I am managing the map of sequences outside of alignment  chunk.
>
> I faced this issue with Mauve parsing (Bio::SeqIO::xmfa).  I set it up  so 
> LocatableSeqs can now have all gaps, but the alphabet needs to be  set (get a 
> warning otherwise) and start and end need to be initiated  to 0, which is how 
> I believe Mauve defined these.  However, should 0-0  be a valid start/end for 
> such a sequence?  Should we change that to  automatically allow start = end = 
> X (any position including 0) if a  sequence is all gaps or empty?

I don't like any old start==end implying zero length, even under the condition
that the underlying sequence is empty, since in the "1-origin, endpoints" model
that pervades BP (as opposed to the "0-origin, length" model of, say, substr),
the pair ($start, $end) has the strong connotation of "the residue at $start"
if $start==$end. (At least, it does now that we've fixed LocatableSeq...)

What if we consider 0 to be special, the 'sequence anchor', that takes up no
real space? I'm thinking of 'point' and 'mark' in emacs, that actually point at
the interstices between characters, and not the characters themselves. Or \G
for something perly. Then $start means not just the coordinate, but the
'space' before the residue at $start,  and $end means the 'space' after the
residue at $end. If $start==$end > 0, how many residues between $start and $end?
One. If $start == $end == 0, how many residues? None, because the anchor
is special, it doesn't take up residues.

If a sequence is all gaps, what's its length? It has an understood anchor at 0,
then the gap symbols are removed, so its length is the length of the anchor
alone, which is zero, and $start = 0 is the 'space' before the anchor, and
$end = 0 is the 'space' after the anchor.

Now this would mean say, $s->subseq(0, $n) eq $s->subseq(1, $n). I don't
think this is too troubling, since 1) it's consistent with the concept above, 
and
2) zeros would only show up when the empty sequences are encountered.

>
> If we come up with some rough ideas of how to handle this we can add  some 
> examples to the test suite and try getting LocatableSeq to do the  right 
> thing.  We can always mark them as TODO.
>
>> Why not just check to see that the number of seq characters is 0 -  an 
>> all-gapped sequence as the last sequence of the file should still  be legal.
>>
>> Instead:
>> if ( length($seqchar) == 0 ) {
>> undef $aln;
>> return $aln;
>> }
>>
>> Although that would invalidate an empty alignment like this -- do we  want to 
>> still permit these?
>> >A
>>
>> >B
>>
>> >C
>
> These could be zero-length, empty seqs (start = end = undef).  I  thought 
> something was added to PrimarySeq recently for empty seqs.

Maybe the above concept would encompass empty sequences (gapped
or ungapped) without recourse to undef.

cheers, Mark

>
>> Also I've locally implemented possibility of parsing start/end from  the 
>> header line that is part of Mercator output and I think a  variant of UCSC 
>> headers that
>> look like this for MFA.
>> >Cp scaffold_1.3086:644980-660265+
>
> Is there a way we can have a callback option for parsing out the  data?  e.g. 
> pass in everything after '>' and the LocatableSeq  instance, the callback 
> parses the string for whatever info and sets  the LocatableSeq attributes 
> accordingly.  We could default to some  built-in coderefs for common regexes 
> if a callback isn't defined.
>
>> I can commit this to main-trunk so as to not interfere with branch  release. 
>> Do any of the LocatableSeq/AlignIO::Fasta maintainers want  to comment?
>>
>> -jason
>
> +1.  I think we should incorporate this into 1.6.1 along with any  other 
> LocatableSeq fixes and tests.
>
> chris
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> 




More information about the Bioperl-l mailing list