[Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent

Sat Apr 27 20:40:52 UTC 2013

On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:
> Peter,
>
> Thanks for the detailed explanation. It's very helpful. I am not quite
> sure about the goal of the lazy-loading parser.
> Let me try to summarize what are the goals of lazy-loading and how
> lazy-loading would work. Please correct me if necessary. Below I use
> fasta/fastq file as an example. The idea should generally applies to
> other format such as GenBank/EMBL as you mentioned.
>
> Lazy-loading is useful under the assumption that given a large file,
> we are interested in partial information of it but not all of them.
> For example a fasta file contains Arabidopsis genome, we only
> interested in the sequence of chr5 from index position from 2000-3000.
> Rather than parsing the whole file and storing each record in memory
> as most parsers will do,  during the indexing step, lazy loading
> parser will only store a few position information, such as access
> positions (readily usable for seek) for all chromosomes (chr1, chr2,
> chr3, chr4, chr5, ...) and may be position index information such as
> the access positions for every 1000bp positions for each sequence in
> the given file. After indexing, we store these information in a
> dictionary like following {'chr1':{0:access_pos, 1000:access_pos,
> 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
> 2000:access_pos,}, 'chr3'...}.
>
> Compared to the usual parser which tends to parsing the whole file, we
> gain two benefits: speed, less memory usage and random access. Speed
> is gained because we skipped a lot during the parsing step. Go back to
> my example, once we have the dictionary, we can just seek to the
> access position of chr5:2000 and start reading and parsing from there.
> Less memory usage is due to we only stores access positions for each
> record as a dictionary in memory.
>
>
> Best,
>
> Zhigang

Hi Zhigang,

Yes - that's the basic idea of a disk based lazy loader. Here
the data stays on the disk until needed, so generally this is
very low memory but can be slow as it needs to read from
the disk. And existing example already in Biopython is our
BioSQL bindings which present a SeqRecord subclass which
only retrieves values from the database on demand.

Note in the case of FASTA, we might want to use the existing
FAI index files from Heng Li's faidx tool (or another existing
index scheme). That relies on each record using a consistent
line wrapping length, so that seek offsets can be easily
calculated.

An alternative idea is to load the data into memory (so that the
file is not touched again, useful for stream processing where
you cannot seek within the input data) but it is only parsed into
Python objects on demand. This would use a lot more memory,
but should be faster as there is no disk seeking and reading
(other than the one initial read). For FASTA this wouldn't help
much but it might work for EMBL/GenBank.

Something to beware of with any lazy loading / lazy parsing is
what happens if the user tries to edit the record? Do you want
to allow this (it makes the code more complex) or not (simpler
and still very useful).

In terms of usage examples, for things like raw NGS data this
is (currently) made up of lots and lots of short sequences (under
1000bp). Lazy loading here is unlikely to be very helpful - unless
perhaps you can make the FASTQ parser faster this way?
(Once the reads are assembled or mapped to a reference,
random access to lookup reads by their mapped location is
very very important, thus the BAI indexing of BAM files).

In terms of this project, I was thinking about a SeqRecord
style interface extending Bio.SeqIO (but you can suggest
something different for your project).

What I saw as the main use case here is large datasets like
whole chromosomes in FASTA format or richly annotated
formats like EMBL, GenBank or GFF3. Right now if I am
doing something with (for example) the annotated human
chromosomes, loading these as GenBank files is quite
slow (it takes a far amount of memory too, but that isn't
my main worry). A lazy loading approach should let me
'load' the GenBank files almost instantly, and delay
reading specific features or sequence from the disk
until needed.

For example, I might have a list of genes for which I wish
to extract the annotation or sequence for - and there is no
need to load all the other features or the rest of the genome.

(Note we can already do this by loading GenBank files
into a BioSQL database, and access them that way)

Regards,

Peter