[Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent

Wed May 1 14:17:14 UTC 2013

Hi Peter and all,
Thanks for the long explanation.
I got much better understand of this project though I am still confusing on
how to implement the lazy-loading parser for feature rich files (EMBL,
GenBank, GFF3).
Since the deadline is pretty close,I decided to post my premature of
proposal for this project. It would be great if you all can given me some
comments and suggestions. The proposal is available
here<https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing>.
Thank you all in advance.

Zhigang

On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
> wrote:
> > Peter,
> >
> > Thanks for the detailed explanation. It's very helpful. I am not quite
> > sure about the goal of the lazy-loading parser.
> > Let me try to summarize what are the goals of lazy-loading and how
> > lazy-loading would work. Please correct me if necessary. Below I use
> > fasta/fastq file as an example. The idea should generally applies to
> > other format such as GenBank/EMBL as you mentioned.
> >
> > Lazy-loading is useful under the assumption that given a large file,
> > we are interested in partial information of it but not all of them.
> > For example a fasta file contains Arabidopsis genome, we only
> > interested in the sequence of chr5 from index position from 2000-3000.
> > Rather than parsing the whole file and storing each record in memory
> > as most parsers will do,  during the indexing step, lazy loading
> > parser will only store a few position information, such as access
> > positions (readily usable for seek) for all chromosomes (chr1, chr2,
> > chr3, chr4, chr5, ...) and may be position index information such as
> > the access positions for every 1000bp positions for each sequence in
> > the given file. After indexing, we store these information in a
> > dictionary like following {'chr1':{0:access_pos, 1000:access_pos,
> > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
> > 2000:access_pos,}, 'chr3'...}.
> >
> > Compared to the usual parser which tends to parsing the whole file, we
> > gain two benefits: speed, less memory usage and random access. Speed
> > is gained because we skipped a lot during the parsing step. Go back to
> > my example, once we have the dictionary, we can just seek to the
> > access position of chr5:2000 and start reading and parsing from there.
> > Less memory usage is due to we only stores access positions for each
> > record as a dictionary in memory.
> >
> >
> > Best,
> >
> > Zhigang
>
> Hi Zhigang,
>
> Yes - that's the basic idea of a disk based lazy loader. Here
> the data stays on the disk until needed, so generally this is
> very low memory but can be slow as it needs to read from
> the disk. And existing example already in Biopython is our
> BioSQL bindings which present a SeqRecord subclass which
> only retrieves values from the database on demand.
>
> Note in the case of FASTA, we might want to use the existing
> FAI index files from Heng Li's faidx tool (or another existing
> index scheme). That relies on each record using a consistent
> line wrapping length, so that seek offsets can be easily
> calculated.
>
> An alternative idea is to load the data into memory (so that the
> file is not touched again, useful for stream processing where
> you cannot seek within the input data) but it is only parsed into
> Python objects on demand. This would use a lot more memory,
> but should be faster as there is no disk seeking and reading
> (other than the one initial read). For FASTA this wouldn't help
> much but it might work for EMBL/GenBank.
>
> Something to beware of with any lazy loading / lazy parsing is
> what happens if the user tries to edit the record? Do you want
> to allow this (it makes the code more complex) or not (simpler
> and still very useful).
>
> In terms of usage examples, for things like raw NGS data this
> is (currently) made up of lots and lots of short sequences (under
> 1000bp). Lazy loading here is unlikely to be very helpful - unless
> perhaps you can make the FASTQ parser faster this way?
> (Once the reads are assembled or mapped to a reference,
> random access to lookup reads by their mapped location is
> very very important, thus the BAI indexing of BAM files).
>
> In terms of this project, I was thinking about a SeqRecord
> style interface extending Bio.SeqIO (but you can suggest
> something different for your project).
>
> What I saw as the main use case here is large datasets like
> whole chromosomes in FASTA format or richly annotated
> formats like EMBL, GenBank or GFF3. Right now if I am
> doing something with (for example) the annotated human
> chromosomes, loading these as GenBank files is quite
> slow (it takes a far amount of memory too, but that isn't
> my main worry). A lazy loading approach should let me
> 'load' the GenBank files almost instantly, and delay
> reading specific features or sequence from the disk
> until needed.
>
> For example, I might have a list of genes for which I wish
> to extract the annotation or sequence for - and there is no
> need to load all the other features or the rest of the genome.
>
> (Note we can already do this by loading GenBank files
> into a BioSQL database, and access them that way)
>
> Regards,
>
> Peter
>