[Biopython-dev] Releasing Biopython 1.62 this week?

Wed Aug 28 12:12:24 UTC 2013

Hi Peter, everyone,

On Tue, Aug 27, 2013 at 9:27 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Aug 27, 2013 at 7:45 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> Sounds good. Mind if I sneak in a quick update to the Phylo chapter of the
>> Tutorial to mention CDAO support?
>
> Go for it - I need to retest the DSSP unit test tomorrow anyway.
>
>> Also, has anything else noteworthy been added since the beta that we can
>> announce in the NEWS file?
>
> Minor bug fixes and more tests? Perhaps the PDB occupancy change?
>
> Peter

I don't like to believe in coincidences, but just last night a user
emailed me about an issue in SearchIO's exonerate parser which I feel
should be mentioned here (exchange attached on his permission). He
stumbled on an error where an
exonerate output file is unparseable because of split codon
alignments. In short, I feel we should not lift the
BiopythonExperimentalWarning for the 1.62 release.

The issue is caused by protein to genome alignments in exonerate (in
the protein2genome alignment mode) that has split codons in it. When
split codons are present, SearchIO splits these HSPs into fragments
which are basically a single contiguous sequence alignment. These
fragments have their own Seq objects (representing hit and query
sequences). The problem is, these Seq objects have to be full
sequences and the query sequence fragment (protein) do not represent a
full sequence here (since the underlying codon is split).

Currently, SearchIO raises an AssertionError when this type of
alignment is found and simply says it can not deal with it. This
should not remain the case, though. A test case was actually put up
for this (https://github.com/biopython/biopython/blob/master/Tests/Exonerate/exn_22_m_protein2genome.exn#L173).
However, since I have yet to find a way to properly represent these
fragments with Seq objects, the actual test have not been written (and
I missed this when doing the last review).

I have thought of several alternatives:

* I saw a ThreeLetterProtein Alphabet in
https://github.com/biopython/biopython/blob/master/Bio/Alphabet/__init__.py#L136,
maybe we could use this to create Seq objects that allows partial
codons?

* Change HSPFragment to not use full Seq objects anymore (which may
require some rework on the HSP objects as well)

But have not explored them thoroughly. I should note that Zheng Ruan's
GSoC project on Codon alignments
(http://zruanweb.com/category/gsoc.html) may prove useful as well
here.

While I don't expect the issue to pop up often (it shows up only when
exonerate is used with the protein2genome mode out of the many modes
it has and when the alignment hits a split codon), I feel like it
should be discussed (if not, mentioned) here first since dealing with
the issue may require some more reworking.

So I'm sorry for the late warning and missing this. I hope this is not
too late :).

Best,
Bow
-------------- next part --------------
On Wed, Aug 28, 2013 at 10:31 AM, Wibowo Arindrarto <w.arindrarto at gmail.com> wrote:
> Hi Somak,
>
>> Do you have any idea whether Bioperl based Exonerate parser can handle such cases?
>> I'm yet to try Bioperl.
>
> I tried your file with Bioperl's parser, and while it can parse the
> entire file without errors, I don't know whether all the information
> in the file (sequence, sequence coordinates) are parsed properly. But
> maybe that's just me being less familiar with Bioperl. I suggest
> posting to their mailing list
> (http://lists.open-bio.org/pipermail/bioperl-l/) or searching the list
> archive if you have any questions regarding this. The library also
> have an active community behind it.
>
>> And please feel free to forward this mail to Biopythonlist or any other discussion forum you
>> think is appropriate,
>
> Ok, thanks :).
>
>> Thanks again
>>
>> Somak Ray
>
> Best,
> Bow
>
>> ________________________________________
>> From: w.arindrarto at gmail.com [w.arindrarto at gmail.com] on behalf of Wibowo Arindrarto [bow at bow.web.id]
>> Sent: Tuesday, August 27, 2013 8:01 PM
>> To: Ray, Somak
>> Subject: Re: On parsing of exonerate output
>>
>> Hi Somak,
>>
>>> Dear Dr. Arindrarto,
>>>
>>> I came across your blog about parsing outputs from Exonerate . I have some
>>> generated some files using exonarates protein2dna model. However when
>>> running your scripts on them I'm getting some assertion error in python 2.7.
>>> I'm attaching  two of such exonerate outputs.The "Result_goodfile.txt" can
>>> be passed by the parser whereas "Result_badfile.txt" can't be parsed.
>>>
>>> Please let me know if there's any solution to the problem.
>>>
>>> Thanks in advance
>>
>> Hmm..looking at the files, it seems that this is caused by a split
>> codon in the alignment (Results_badfile.txt, line 25). The problem is,
>> the three-letter amino acid sequence needs to be translated into a
>> single-letter amino acid sequence since Biopython could not create Seq
>> objects with three-letter amino acid codes. However, this conversion
>> means that codons that span introns (as the one on line 25) could not
>> be dealt with properly since a single fragment expects a full Seq
>> object (hence the error you're seeing;  it expects the three-letter
>> amino acid sequence length to be multiples of three).
>>
>> So the short answer is no, there is not yet an immediate solution to this issue.
>>
>> I should mention that this came at an appropriate time, though, so
>> thanks for the email :). I am reviewing known SearchIO issues and this
>> was apparently an issue that I have lost track of (checking at the
>> test suite, there is a test for this case but it has not been included
>> in the test suite).
>>
>> Do you mind if I forward this email to the Biopython list
>> (http://biopython.org/wiki/Mailing_lists)? I think other developers /
>> users may be interested in this.
>>
>> Best,
>> Bow