[Biopython-dev] SeqIO and qual: Question about reading and writing qual files

Wed Mar 25 10:01:45 UTC 2009

On Tue, Mar 24, 2009 at 3:33 PM, Sebastian Bassi
<sbassi at clubdelarazon.org> wrote:
> On Tue, Mar 24, 2009 at 12:13 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> ....
>> characters using a DNA alphabet. What would you expect to get if you
>> used Bio.SeqIO to write out the file in FASTA format?  To my mind there
>> are two sensible options - write out the file using the "NNN....N"
>> sequence, or raise an error.
>
> "N" is OK (with the same length of the qual file), that is what ABI
> does when the QV is low. This is not the same case but I always think
> of "N" as "unknown".
> Raise an error is not bad because I don't see the need to go from an
> non-sequence qual to a fasta (it doesn't make sense). But that I don't
> see the need, doesn't means someone else may have a reason.
> Best,

I've filed an enhancement bug for the possible enhancement to add an UnknownSeq
object, perhaps as part of the Bio.Seq module, Bug 2799
http://bugzilla.open-bio.org/show_bug.cgi?id=2799

I've done an initial patch (which I plan to upload on Bugzilla) which
is available now
on git hub on a new branch:
http://github.com/peterjc/biopython/tree/bug2799-UnknownSeq

Note this doesn't do anything special (yet) when writing output files,
so they will
by default record a string of whatever unknown sequence character was used.

It would make sense for GenBank/EMBL in SeqIO to also take advantage o
the UnknownSeq object, because here the sequence is essentially optional
(consider files with just a CONTIG line), but should always have a length.

Sebastian - could you have a quick play with this github code (using the new
UnknownSeq class), and the current CVS code (using None), and make sure
both support the slicing operations you were trying earlier?  Thanks.

Peter