[Bioperl-l] alignIO::fasta bug
Mark A. Jensen
maj at fortinbras.us
Mon Jan 26 03:10:08 UTC 2009
Chris--
Yes, I see what you mean. Trickier than I thought. Sounds like its time to
'mature' LocatableSeq.
Wiki's a great place for that-
MAJ
----- Original Message -----
From: "Chris Fields" <cjfields at illinois.edu>
To: "Mark A. Jensen" <maj at fortinbras.us>
Cc: "Jason Stajich" <jason at bioperl.org>; "BioPerl List" <bioperl-l at bioperl.org>
Sent: Saturday, January 24, 2009 12:25 AM
Subject: Re: [Bioperl-l] alignIO::fasta bug
>
> On Jan 22, 2009, at 10:54 PM, Mark A. Jensen wrote:
>
>> couple of thoughts...
>>
>> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
>> >
>> To: "Jason Stajich" <jason at bioperl.org>
>> Cc: "BioPerl List" <bioperl-l at bioperl.org>
>> Sent: Thursday, January 22, 2009 10:54 PM
>> Subject: Re: [Bioperl-l] alignIO::fasta bug
>>
>>> On Jan 22, 2009, at 3:33 PM, Jason Stajich wrote:
>>>
>>>> FYI - it appears that if the last sequence in a FASTA MSA is all gaps
>>>> (which happens in some whole genome synteny+multiple alignment chunking
>>>> like Mercator) no alignment is returned.
>>>>
>>>> yuck. It basically comes down to this bit of code where $end would equal
>>>> sequence length.
>>>>
>>>> # If $end <= 0, we have either reached the end of
>>>> # file in <> or we have encountered some other error
>>>> if ( $end <= 0 && ) {
>>>> undef $aln;
>>>> return $aln;
>>>> }
>> [haven't looked at the code, but] On the surface, this looks like a bit more
>> responsibility than $end should be expected to handle, so I like Jason's
>> solution below better, which is only masquerading as a kludge.
>>
>>>>
>>>> This start/end requirement of locatable seq is nice but kind of a pain
>>>> where I am managing the map of sequences outside of alignment chunk.
>>>
>>> I faced this issue with Mauve parsing (Bio::SeqIO::xmfa). I set it up so
>>> LocatableSeqs can now have all gaps, but the alphabet needs to be set (get
>>> a warning otherwise) and start and end need to be initiated to 0, which is
>>> how I believe Mauve defined these. However, should 0-0 be a valid
>>> start/end for such a sequence? Should we change that to automatically
>>> allow start = end = X (any position including 0) if a sequence is all gaps
>>> or empty?
>>
>> I don't like any old start==end implying zero length, even under the
>> condition
>> that the underlying sequence is empty, since in the "1-origin, endpoints"
>> model
>> that pervades BP (as opposed to the "0-origin, length" model of, say,
>> substr),
>> the pair ($start, $end) has the strong connotation of "the residue at
>> $start"
>> if $start==$end. (At least, it does now that we've fixed LocatableSeq...)
>
> It could just as easily be undef (no start/end). However, see below...
>
>> What if we consider 0 to be special, the 'sequence anchor', that takes up no
>> real space? I'm thinking of 'point' and 'mark' in emacs, that actually point
>> at
>> the interstices between characters, and not the characters themselves. Or \G
>> for something perly. Then $start means not just the coordinate, but the
>> 'space' before the residue at $start, and $end means the 'space' after the
>> residue at $end. If $start==$end > 0, how many residues between $start and
>> $end?
>> One. If $start == $end == 0, how many residues? None, because the anchor
>> is special, it doesn't take up residues.
>>
>> If a sequence is all gaps, what's its length? It has an understood anchor at
>> 0,
>> then the gap symbols are removed, so its length is the length of the anchor
>> alone, which is zero, and $start = 0 is the 'space' before the anchor, and
>> $end = 0 is the 'space' after the anchor.
>
> Where this becomes sticky is taking alignment sections, a problem encountered
> when attempting to create interleaved output, for instance. Let's say (for
> example) you have a long alignment that has large gaps (think genome
> alignment ala MUMmer or Mauve). If you take a subsection of that alignment,
> you could very possibly end up with a sequence that is all gaps. Here is an
> example from Rfam (note AE013109.1):
>
> AL596170.1/181975-181872
> AG................................................
> BA000016.3/1419453-1419329
> UU..........................................AAUUAU
> AE001437.1/2206791-2206643
> AGGAAUUAUUUUGAAGAACUUUAUAGACGUAGAA..........GAAAAA
> BA000016.3/1419328-1419245
> GG................................................
> AE013183.1/26-124
> AG.............................................GAU
> AE013109.1/12950-12813
> GAGA........................................AUAUAG
> AE013109.1 / 13583 -13495 ..................................................
> AF044978.1/176-335
> AAUC........................................AUGCAA
> X84262.1/148-332
> CAGCUCAAAAUCCAUAUAUAA.......................ACAAGA
> CR954253.1/898444-898260
> UCUCUCAAAGUCUA..............................CUUAAA
>
> So let's say we take the slice above from the alignment and create a new
> SimpleAlign. We know the original coordinates for each LocatableSeq; should
> we override them and mark the start = end = 0? Or would we want to have a
> sequence that is indicated as between the end of the last known position and
> the beginning of the next (500^501)? This also doesn't take into account
> start/end gaps; maybe for beginning gaps start = end = 0 and end gaps start =
> end = length.
>
> For no sequence at all (undef) start = end = undef as well.
>
>> Now this would mean say, $s->subseq(0, $n) eq $s->subseq(1, $n). I don't
>> think this is too troubling, since 1) it's consistent with the concept
>> above, and
>> 2) zeros would only show up when the empty sequences are encountered.
>
> Okay.
>
>>> If we come up with some rough ideas of how to handle this we can add some
>>> examples to the test suite and try getting LocatableSeq to do the right
>>> thing. We can always mark them as TODO.
>>>
>>>> Why not just check to see that the number of seq characters is 0 - an
>>>> all-gapped sequence as the last sequence of the file should still be
>>>> legal.
>>>>
>>>> Instead:
>>>> if ( length($seqchar) == 0 ) {
>>>> undef $aln;
>>>> return $aln;
>>>> }
>>>>
>>>> Although that would invalidate an empty alignment like this -- do we want
>>>> to still permit these?
>>>> >A
>>>>
>>>> >B
>>>>
>>>> >C
>>>
>>> These could be zero-length, empty seqs (start = end = undef). I thought
>>> something was added to PrimarySeq recently for empty seqs.
>>
>> Maybe the above concept would encompass empty sequences (gapped
>> or ungapped) without recourse to undef.
>>
>> cheers, Mark
>
> From this discussion I think it's worth coming up with some kind of idea on
> what we expect LocatableSeq to do under various circumstances; the wiki is a
> good place to start something up. I don't think that refactoring everything
> is necessary, but it might be worth putting out some variations for
> consideration.
>
> For instance I have had the idea of making a new RangeI-based Bio::SeqI that
> would just delegate the RangeI methods to a 'source' SeqFeature containing a
> Bio::LocationI. The PrimarySeqI would be a simplified LocatableSeq capable
> of dealing with subseq issues as discussed previously, but simplified. In
> fact, it should probably recognize any PrimarySeqI.
>
> Advantages: can add annotation and features specific for the sequence, and
> (beyond simple decorated methods used to make access simple) it is a Bio::Seq
> and could be persisted in a BioSQL. Disadvantages: lots of objects and thus
> slower.
>
> chris
>
>
More information about the Bioperl-l
mailing list