[Biopython-dev] SeqIO Abi Parser

Sat Jul 30 07:42:04 UTC 2011

Hi Peter,

I've done some more improvements to the code:

- I've written the check and unittest for the file handle mode. I've set it
so that abi file has to be opened in 'rb' mode, otherwise it'll return an
error. While it's ok to open in 'r' mode in python 2 in Linux, it has to be
specified as 'rb' in Windows and/or Python 3 for the file to be read
correctly. So I decided forcing it to 'rb' is the best. Because of this, I
changed 'test_SeqIO.py:503' to include the mode argument when opening.

- I've also checked against test_Emboss.py for seqret output, after
including the abi format in it. My EMBOSS version is 6.4.0. There was a
slight problem with this testing, since for some reason the ID returned by
seqret is always "EMBOSS_001". Something might be wrong with my EMBOSS
installation, since when I previously tested it against 6.1.0, the ID was
correct (although the qual values not, so I had to upgrade). As expected, if
I comment out the code that tests for sequence id ('test_Emboss.py:168-172')
the tests pass. Maybe you could try testing it as well and see if EMBOSS
also returns the default id instead of the sample name?

- Finally, I did some small cosmetic changes to the code (typos, etc).

All changes have been pushed to my github fork. Now I still have time for
the weekend to improve whatever needs to be improved :).

Regards,
---
Wibowo Arindrarto (bow)
http://bow.web.id

On Fri, Jul 29, 2011 at 18:20, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> Hi again,
>
> I had a bit of time this afternoon so I looked at this.
>
> On Fri, Jul 29, 2011 at 1:14 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > On Fri, Jul 29, 2011 at 12:34 PM, Wibowo Arindrarto wrote:
> >> Hi Peter,
> >> Thanks for explaining. I understand why we should stick to the stored
> >> sequence id. In this case, we can use the filename as SeqRecord.name as
> >> well. Regarding BioPerl, I don't have it installed myself -- but I took
> a
> >> quick look at their source and it seems they also use the stored
> sequence ID
> >> as their main identifier instead of the filename. If the stored sequence
> ID
> >> is not present, it's "(unknown)" in their case.
> >
> > OK good, that means Biopython, BioPerl and EMBOSS should be
> > consistent :)
>
> I've made that switch,
>
> >> I'll look on the test_SeqIO.py over the weekend. I think it'll have
> >> something to do with some ambiguous dna base stored in the abi files.
> >> Regards,
> >
> > Some of the alphabet stuff is a bit nasty - so please feel free to ask
> > or get me to help.
>
> I've done enough to get the test_SeqIO.py unit test to pass.
>
> We probably need a check (like in SFF) to check the user hasn't given
> a handle opened in text mode. That should probably have a unit test
> too.
>
> I still haven't cross checked the sequence and PHRED scores from
> your code and EMBOSS.
>
> Anyway - I'll leave the code for you to work on for now...
>
> Peter
>