[Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent

Fri May 3 06:32:43 UTC 2013

Hi Zhigang,

Sequence read files from Next Generation Sequencing methods are several 
GB large. Don't know if they are regulary stored in GFF files, anyhow.

Best,

Markus

Am 2013-05-02 23:18, schrieb Zhigang Wu:
> Hi Chris and All,
>
> In your comments to my proposal, you mentioned that some GFF files 
> may have
> a size of GBs.
> After seeing that comment, I just want to roughly know how large is a 
> gff
> file people are often working with?
> I mainly work on plants and I am not quite familiar with animals.
> Below I listed out a list of animals and plants, to my knowledge from
> reading papers,  which most people are working with.
>
> organism(genome size)                      size of gff         url to 
> the
> ftp *folder*(not a huge file so feel free to click it)
> arabidopsis(~120MB)                         44MB
> ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/
> rice(~450MB)                                     77MB
> 
> here<ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/>
> corn(3GB)                                          87MB
> http://ftp.maizesequence.org/release-5b/filtered-set/
> D. melanogaster                                450MB
> 
> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.50_FB2013_02/gff/
> C. elegans                                         (site going down)
> http://wiki.wormbase.org/index.php/Downloads#GFF2
> H. sapiens(3G)                                   170MB
> 
> here<http://galaxy.raetschlab.org/library_common/browse_library?show_deleted=False&cntrller=library&use_panels=False&id=2f94e8ae9edff68a>
>
> My point is that caching gff files in memory wasn't as bad as we have
> thought. Any comments or suggestion are welcome.
>
> Best,
>
>
> Zhigang
>
>
>
>
> On Wed, May 1, 2013 at 7:40 AM, Chris Mitchell <chris.mit7 at gmail.com> 
> wrote:
>
>> Hi Zhigang,
>>
>> I throw some comments on your proposal.  As i said there, I think 
>> you need
>> to find & look at a variety of gff/gtf files to see where your
>> implementation breaks down.  Also, for parsing, I would focus on 
>> optimizing
>> the speed the user can access attributes, they're the bits people 
>> care most
>> about (where is gene X, what is the FPKM of isoform y?, etc.)
>>
>> Chris
>>
>>
>> On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu 
>> <zhigangwu.bgi at gmail.com>wrote:
>>
>>> Hi Peter and all,
>>> Thanks for the long explanation.
>>> I got much better understand of this project though I am still 
>>> confusing
>>> on
>>> how to implement the lazy-loading parser for feature rich files 
>>> (EMBL,
>>> GenBank, GFF3).
>>> Since the deadline is pretty close,I decided to post my premature 
>>> of
>>> proposal for this project. It would be great if you all can given 
>>> me some
>>> comments and suggestions. The proposal is available
>>> here<
>>> 
>>> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
>>> >.
>>>
>>> Thank you all in advance.
>>>
>>>
>>> Zhigang
>>>
>>>
>>>
>>> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock 
>>> <p.j.a.cock at googlemail.com
>>> >wrote:
>>>
>>> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu 
>>> <zhigangwu.bgi at gmail.com>
>>> > wrote:
>>> > > Peter,
>>> > >
>>> > > Thanks for the detailed explanation. It's very helpful. I am 
>>> not quite
>>> > > sure about the goal of the lazy-loading parser.
>>> > > Let me try to summarize what are the goals of lazy-loading and 
>>> how
>>> > > lazy-loading would work. Please correct me if necessary. Below 
>>> I use
>>> > > fasta/fastq file as an example. The idea should generally 
>>> applies to
>>> > > other format such as GenBank/EMBL as you mentioned.
>>> > >
>>> > > Lazy-loading is useful under the assumption that given a large 
>>> file,
>>> > > we are interested in partial information of it but not all of 
>>> them.
>>> > > For example a fasta file contains Arabidopsis genome, we only
>>> > > interested in the sequence of chr5 from index position from 
>>> 2000-3000.
>>> > > Rather than parsing the whole file and storing each record in 
>>> memory
>>> > > as most parsers will do,  during the indexing step, lazy 
>>> loading
>>> > > parser will only store a few position information, such as 
>>> access
>>> > > positions (readily usable for seek) for all chromosomes (chr1, 
>>> chr2,
>>> > > chr3, chr4, chr5, ...) and may be position index information 
>>> such as
>>> > > the access positions for every 1000bp positions for each 
>>> sequence in
>>> > > the given file. After indexing, we store these information in a
>>> > > dictionary like following {'chr1':{0:access_pos, 
>>> 1000:access_pos,
>>> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
>>> > > 2000:access_pos,}, 'chr3'...}.
>>> > >
>>> > > Compared to the usual parser which tends to parsing the whole 
>>> file, we
>>> > > gain two benefits: speed, less memory usage and random access. 
>>> Speed
>>> > > is gained because we skipped a lot during the parsing step. Go 
>>> back to
>>> > > my example, once we have the dictionary, we can just seek to 
>>> the
>>> > > access position of chr5:2000 and start reading and parsing from 
>>> there.
>>> > > Less memory usage is due to we only stores access positions for 
>>> each
>>> > > record as a dictionary in memory.
>>> > >
>>> > >
>>> > > Best,
>>> > >
>>> > > Zhigang
>>> >
>>> > Hi Zhigang,
>>> >
>>> > Yes - that's the basic idea of a disk based lazy loader. Here
>>> > the data stays on the disk until needed, so generally this is
>>> > very low memory but can be slow as it needs to read from
>>> > the disk. And existing example already in Biopython is our
>>> > BioSQL bindings which present a SeqRecord subclass which
>>> > only retrieves values from the database on demand.
>>> >
>>> > Note in the case of FASTA, we might want to use the existing
>>> > FAI index files from Heng Li's faidx tool (or another existing
>>> > index scheme). That relies on each record using a consistent
>>> > line wrapping length, so that seek offsets can be easily
>>> > calculated.
>>> >
>>> > An alternative idea is to load the data into memory (so that the
>>> > file is not touched again, useful for stream processing where
>>> > you cannot seek within the input data) but it is only parsed into
>>> > Python objects on demand. This would use a lot more memory,
>>> > but should be faster as there is no disk seeking and reading
>>> > (other than the one initial read). For FASTA this wouldn't help
>>> > much but it might work for EMBL/GenBank.
>>> >
>>> > Something to beware of with any lazy loading / lazy parsing is
>>> > what happens if the user tries to edit the record? Do you want
>>> > to allow this (it makes the code more complex) or not (simpler
>>> > and still very useful).
>>> >
>>> > In terms of usage examples, for things like raw NGS data this
>>> > is (currently) made up of lots and lots of short sequences (under
>>> > 1000bp). Lazy loading here is unlikely to be very helpful - 
>>> unless
>>> > perhaps you can make the FASTQ parser faster this way?
>>> > (Once the reads are assembled or mapped to a reference,
>>> > random access to lookup reads by their mapped location is
>>> > very very important, thus the BAI indexing of BAM files).
>>> >
>>> > In terms of this project, I was thinking about a SeqRecord
>>> > style interface extending Bio.SeqIO (but you can suggest
>>> > something different for your project).
>>> >
>>> > What I saw as the main use case here is large datasets like
>>> > whole chromosomes in FASTA format or richly annotated
>>> > formats like EMBL, GenBank or GFF3. Right now if I am
>>> > doing something with (for example) the annotated human
>>> > chromosomes, loading these as GenBank files is quite
>>> > slow (it takes a far amount of memory too, but that isn't
>>> > my main worry). A lazy loading approach should let me
>>> > 'load' the GenBank files almost instantly, and delay
>>> > reading specific features or sequence from the disk
>>> > until needed.
>>> >
>>> > For example, I might have a list of genes for which I wish
>>> > to extract the annotation or sequence for - and there is no
>>> > need to load all the other features or the rest of the genome.
>>> >
>>> > (Note we can already do this by loading GenBank files
>>> > into a BioSQL database, and access them that way)
>>> >
>>> > Regards,
>>> >
>>> > Peter
>>> >
>>> _______________________________________________
>>> Biopython-dev mailing list
>>> Biopython-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>>
>>
>>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev