[Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent

Sat Apr 27 11:20:57 UTC 2013

On Sat, Apr 27, 2013 at 1:52 AM, Zhigang Wu <zhigang.wu at email.ucr.edu> wrote:
> Hi Peter,
>
> I am interested in implementing the lazy-loading sequence parsers.
> I know the time is pretty tight for me to write an proposal on it. But even
> I cannot contribute under the umbrella of GSoC and assuming no body is
> implemented, I am still interested in implementing this (I just wanna have
> something nice on my CV and while contributing to Open source software
> community as well). While at this moment, I don't have very clear picture on
> how to do it. Can you point me to somewhere where I can start to get a sense
> how this can be implemented. As far as I know, samtools (view) may have
> similar techniques in them. Thanks.
>
>
> Zhigang

Hi Zhigang,

It isn't too late to write up a proposal for GSoC 2013, but please
also introduce yourself on the NESCent Phyloinformatics
Summer of Code community on Google Plus:
https://plus.google.com/communities/105828320619238393015

The GSoC program is a great chance to spend a few months
focussed just on one programming project - which can be
really fun. However, the fact that you're interested in making
contributions outside of GSoC is great.

I wrote some more about the lazy-loading sequence parsers
and indexing idea on the biopython-dev mailing list last month:
http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html

However, lazy-parsing can also be done separately from the
indexing. This is something I was trying in my experimental
SAM/BAM parser mentioned on this thread:
http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010492.html

The basic idea here was that the raw data for each record was
loaded into memory as a (bytes) string, but not all of it was
parsed into the individual fields right away. For example, the
tags get turned into a dictionary only if the user tried to use
the tag values. Similarly for many of the BAM fields, the binary
string was only decoded if needed.

I once tried something similar with the FASTQ parser. I wrote
a subclass to preserve the normal SeqRecord interface, but
only decode the ASCII encoded quality scores into a list of
integers if needed. This worked but that attempt did not seem
to make things any faster.

An example where I think there would be clear benefits to a
lazy parsing approach is EMBL/GenBank files where parsing
the features could be delayed (both the complex feature
location, and their dictionary of annotations).

However, for this to be a successful GSoC project, you
would need to have a good understanding of Python and
how our existing parsers work to have a realistic chance
of completing it. I should be quite a technically exciting
project, with the hope of being able to show big speedups
via benchmarks.

Does that help? Is there a particular file format you'd be
interested in - perhaps something you are already using
in your projects or work?

Regards,

Peter