[Bioperl-l] Announcing Bio::SFF
Fields, Christopher J
cjfields at illinois.edu
Mon Dec 19 19:44:22 UTC 2011
Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best. Barring that, a very simple class for storing data. We've found BioPerl objects/classes pretty heavy.
(for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing).
Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used.
For instance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists.
Chris
Sent from my iPad
On Dec 19, 2011, at 11:05 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:
> On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans
> <l.m.timmermans at students.uu.nl> wrote:
>> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>>>
>>> Could you a link to your /corpus/README.txt file pointing
>>> back to the Biopython original for acknowledgement and
>>> future reference?
>>
>> I forgot about that, I will add it to the next release.
>
> Thanks.
>
>>> Are you doing just SFF parsing for now? Not writing?
>>
>>
>> I haven't written the writer yet (haven't needed it so far). I'd rather
>> release working code early instead of waiting until everything is complete.
>
> I understand - but make sure you've designed the data structures
> in the parser so as to allow the original record to be re-built as SFF.
>
>>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>>> format name "sff" to mean the full read sequence (with mixed
>>> case, upper case for the good sequence, lower cases for any
>>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>>> the trimmed sequences. I would encourage you to do the
>>> same, as part of the general aim of having consistent
>>> sequence format names between BioPerl, Biopython, and
>>> EMBOSS, where possible.
>>
>> I agree, consistency is good.
>
> Great. I'd guess Bio::SeqIO integration would be more important
> that SFF output initially.
>
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list