[Biopython-dev] Bio.Sequencing

Mon Jun 29 14:11:15 UTC 2009

Hi Cymon,

I've checked in some of your patch on Bug 2865 already,
recording the per-letter-annotation which I was planning to
do but hadn't got round to yet - thank you:
http://bugzilla.open-bio.org/show_bug.cgi?id=2865

This means with the latest code you can now use Biopython
to convert a PHD output file into a FASTQ file (or a QUAL
file) which could be handy for doing meta assemblies.

I did relatively recently update SeqIO for the Ace format to
record the qualities - but there is an issue here. Only the
nucleotides get given quality scores, but not the insertions
(gaps, shown as "*" in the Ace file consensus sequence).
Currently the Bio.SeqIO parser gives the gapped sequence.
This means to record the quality scores, we need to give
some null value to the gap characters (and I used None).

What I am wondering about is making the Bio.SeqIO Ace
parser just return the ungapped sequence (and the
associated PHRED quality scores). This means we could
then convert Ace files into FASTQ or QUAL files, and also
a simple Ace to FASTA conversion would give something
useful for downstream analysis (the ungapped consensus).

The gaps *are* important if you want to see how the
consensus was built up - in which case it makes sense to
think about each Ace contig as a kind of multiple sequence
alignment. See this earlier discussion with David Winter:
http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html
http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html

Any thoughts?

Peter