[Biopython-dev] sff file

Thu Oct 29 14:09:07 UTC 2009

On Thu, Oct 29, 2009 at 2:02 PM, Sebastian Bassi
<sbassi at clubdelarazon.org> wrote:
>
> On Tue, Oct 27, 2009 at 8:50 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> As part of the polishing in anticipation of merging the SFF support
>> into the trunk, I've just made some big additions to the docstring
>> (with doctest examples) on the branch - it would be great if you
>> could read over this at some point.
>> http://github.com/peterjc/biopython/tree/sff-seqio
>
> I've read it (you mean the code in SffIO.py). Regarding your questions:

I meant the docstrings in Bio/SeqIO/SffIO.py (i.e. the comments
which get exposed as the API help).

>> What do you think of the current rather pragmatic way I'm
>> handling trimming the SeqRecord objects? i.e. SeqIO file format
>> "sff" gives the full data and supports reading and writing, while
>> SeqIO format "sff-trim" only supports reading and gives trimmed
>> sequences without the flow data. This is a bit of a hack, and the
>> "sff-trim" format could be left out - but then we would need a nice
>> way to trim the full length SeqRecord objects...
>
> sff-trim is OK for me but I am not familiar with this format. I see
> there are some mixed upper and lower case dna sequence, why?
> Are lower case bases with less quality? (like the both extremes in
> standards read).

Yes, they are in mixed case, and this is linked to the quality and
adaptor sequences . I tried to explain in the SffIO docstring
(near the top of Bio/SeqIO/SffIO.py) with examples and the
following text:

>> ... Notice that the sequence is given in mixed case, [there] the
>> central upper case region corresponds to the trimmed sequence.
>> This matches the output of the Roche tools (and the 3rd party tool
>> sff_extract) for SFF to FASTA.

I think I need to remove the word "there" from that paragraph ;)

Peter