[Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent

Sat Apr 27 19:22:07 UTC 2013

Peter,

Thanks for the detailed explanation. It's very helpful. I am not quite
sure about the goal of the lazy-loading parser.
Let me try to summarize what are the goals of lazy-loading and how
lazy-loading would work. Please correct me if necessary. Below I use
fasta/fastq file as an example. The idea should generally applies to
other format such as GenBank/EMBL as you mentioned.

Lazy-loading is useful under the assumption that given a large file,
we are interested in partial information of it but not all of them.
For example a fasta file contains Arabidopsis genome, we only
interested in the sequence of chr5 from index position from 2000-3000.
Rather than parsing the whole file and storing each record in memory
as most parsers will do,  during the indexing step, lazy loading
parser will only store a few position information, such as access
positions (readily usable for seek) for all chromosomes (chr1, chr2,
chr3, chr4, chr5, ...) and may be position index information such as
the access positions for every 1000bp positions for each sequence in
the given file. After indexing, we store these information in a
dictionary like following {'chr1':{0:access_pos, 1000:access_pos,
2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
2000:access_pos,}, 'chr3'...}.

Compared to the usual parser which tends to parsing the whole file, we
gain two benefits: speed, less memory usage and random access. Speed
is gained because we skipped a lot during the parsing step. Go back to
my example, once we have the dictionary, we can just seek to the
access position of chr5:2000 and start reading and parsing from there.
Less memory usage is due to we only stores access positions for each
record as a dictionary in memory.

Best,

Zhigang

On Sat, Apr 27, 2013 at 4:20 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Sat, Apr 27, 2013 at 1:52 AM, Zhigang Wu <zhigang.wu at email.ucr.edu> wrote:
>> Hi Peter,
>>
>> I am interested in implementing the lazy-loading sequence parsers.
>> I know the time is pretty tight for me to write an proposal on it. But even
>> I cannot contribute under the umbrella of GSoC and assuming no body is
>> implemented, I am still interested in implementing this (I just wanna have
>> something nice on my CV and while contributing to Open source software
>> community as well). While at this moment, I don't have very clear picture on
>> how to do it. Can you point me to somewhere where I can start to get a sense
>> how this can be implemented. As far as I know, samtools (view) may have
>> similar techniques in them. Thanks.
>>
>>
>> Zhigang
>
> Hi Zhigang,
>
> It isn't too late to write up a proposal for GSoC 2013, but please
> also introduce yourself on the NESCent Phyloinformatics
> Summer of Code community on Google Plus:
> https://plus.google.com/communities/105828320619238393015
>
> The GSoC program is a great chance to spend a few months
> focussed just on one programming project - which can be
> really fun. However, the fact that you're interested in making
> contributions outside of GSoC is great.
>
> I wrote some more about the lazy-loading sequence parsers
> and indexing idea on the biopython-dev mailing list last month:
> http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html
>
> However, lazy-parsing can also be done separately from the
> indexing. This is something I was trying in my experimental
> SAM/BAM parser mentioned on this thread:
> http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010492.html
>
> The basic idea here was that the raw data for each record was
> loaded into memory as a (bytes) string, but not all of it was
> parsed into the individual fields right away. For example, the
> tags get turned into a dictionary only if the user tried to use
> the tag values. Similarly for many of the BAM fields, the binary
> string was only decoded if needed.
>
> I once tried something similar with the FASTQ parser. I wrote
> a subclass to preserve the normal SeqRecord interface, but
> only decode the ASCII encoded quality scores into a list of
> integers if needed. This worked but that attempt did not seem
> to make things any faster.
>
> An example where I think there would be clear benefits to a
> lazy parsing approach is EMBL/GenBank files where parsing
> the features could be delayed (both the complex feature
> location, and their dictionary of annotations).
>
> However, for this to be a successful GSoC project, you
> would need to have a good understanding of Python and
> how our existing parsers work to have a realistic chance
> of completing it. I should be quite a technically exciting
> project, with the hope of being able to show big speedups
> via benchmarks.
>
> Does that help? Is there a particular file format you'd be
> interested in - perhaps something you are already using
> in your projects or work?
>
> Regards,
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev