[Biopython-dev] SeqIO Abi Parser

Wibowo Arindrarto w.arindrarto at gmail.com
Thu Jul 7 01:16:13 UTC 2011


Hi everyone,

This is my first post in the dev mailing list, so greetings :).

I've been using Biopython for a few months in total now (in a period of ~1.5
years) and before that Python for ~0.5 years. Most of the time, I'm working
with Sanger sequencing results and at one point I was a bit disappointed
that I couldn't find any (bio)python module for reading .ab1 files. That
compelled me to write my first python module that reads those files and
extracts the useful information out of them. In the process I became more
interested in python itself and finally thought it might be neat if
biopython could do this, built-in.

So I forked the main repo, made some changes to my module so it became a
parser for the SeqIO submodule that reads Abi files. It's not cooked 100%
yet, but if anyone is interested in seeing/commenting/criticizing the code,
I'd appreciate that very much. Here's the link:
https://github.com/bow/biopython/blob/seqio-abif/Bio/SeqIO/AbiIO.py

Some features to note:
- I've included a method to trim the sequence based on its quality scores
- the parser does not extract the entire metadata of the trace files, only
ones I consider important for further analysis/annotations. Of course, this
could be changed if the community think some other data should be
included/excluded
- For those of you already familiar with the Abi format, I deliberately
chose the 'PBAS2' tag for the sequence information, which is the unedited
bases after base-calling by the sequencing program.

Some things that I'm doing right now:
- writing unit tests
- making sure it's compatible with Python 3 (thanks Peter :)! )
- completing the docs
- making sure it's compatible with most Abi format versions. Currently I've
only tested it with files from the 310, 3100, and 3700 machines. Does anyone
have some other versions that I can test this with?

As I understand as well, this is not the only Sanger sequencing trace format
out there (e.g. SCF is another). I would be glad to learn more and write a
parser for the SCF format as well. The problem is, I'm not sure this would
be useful in the long run as I've personally never seen anyone use an SCF
file and so I've never had a chance to play around with one. If anyone has
an SCF file lying around and thinks SCF support would be beneficial, I'd be
happy to accept them :).

I guess that's all for now. Thanks for reading!

---
Wibowo Arindrarto (bow)
http://bow.web.id



More information about the Biopython-dev mailing list