[Bioperl-l] Announcing Bio::SFF

Mon Dec 19 19:44:22 UTC 2011

Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best.  Barring that, a very simple class for storing data.  We've found BioPerl objects/classes pretty heavy.

(for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing).

Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used.  

For instance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists.

Chris

Sent from my iPad

On Dec 19, 2011, at 11:05 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans
> <l.m.timmermans at students.uu.nl> wrote:
>> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>>> 
>>> Could you a link to your /corpus/README.txt file pointing
>>> back to the Biopython original for acknowledgement and
>>> future reference?
>> 
>> I forgot about that, I will add it to the next release.
> 
> Thanks.
> 
>>> Are you doing just SFF parsing for now? Not writing?
>> 
>> 
>> I haven't written the writer yet (haven't needed it so far). I'd rather
>> release working code early instead of waiting until everything is complete.
> 
> I understand - but make sure you've designed the data structures
> in the parser so as to allow the original record to be re-built as SFF.
> 
>>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>>> format name "sff" to mean the full read sequence (with mixed
>>> case, upper case for the good sequence, lower cases for any
>>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>>> the trimmed sequences. I would encourage you to do the
>>> same, as part of the general aim of having consistent
>>> sequence format names between BioPerl, Biopython, and
>>> EMBOSS, where possible.
>> 
>> I agree, consistency is good.
> 
> Great. I'd guess Bio::SeqIO integration would be more important
> that SFF output initially.
> 
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l