[Biopython-dev] Gsoc 2014: another aspirant here

Fri Mar 14 05:30:13 UTC 2014

Hi Evan,

Focusing on the SeqIO parsers is ok. That's where having lazy parsers
would help most (and you've got a handful of formats there already).
Remember that you'll also need to account for time to write tests,
possibly benchmark or profile the code (lazy parsers should improve
performance after all), and write documentation, outside of writing
the code itself. You'll also want to be clear about this in your
proposed timeline, since that will be your main guide during the
coding period.

Looking forward to reading your proposal :),
Bow

On Thu, Mar 13, 2014 at 8:04 PM, Evan Parker <eparker at ucdavis.edu> wrote:
> Thank you Bow,
>
> I'll need to digest this a bit, but you have given me a good direction. My
> inclination for the proposal is to focus on sequential file formats used to
> transmit 'databases' of sequences (like fasta, embl, uniprot-xml, swiss, and
> others) and to mostly ignore formats used to convey alignment (ie. anything
> covered exclusively by parsers in AlignIO). If this is a poor direction
> please tell me so that I can add to my preparation.
>
> -Evan
>
> Evan Parker
> Ph.D. Candidate
> Dept. of Chemistry - Lebrilla Lab
> University of California, Davis
>
>
> On Thu, Mar 13, 2014 at 2:04 AM, Wibowo Arindrarto <w.arindrarto at gmail.com>
> wrote:
>>
>> Hi Evan,
>>
>> Thank you for your interest in the project :). It's good to know
>> you're already quite familiar with SeqIO as well.
>>
>> My replies are below.
>>
>> > 1) Should the lazy loading be done primarily in the context of records
>> > returned from the SeqIO.index() dict-like object, or should the lazy
>> > loading be available to the generator made by SeqIO.parse()? The project
>> > idea in the wiki mentions adding lazy-loading to SeqIO.parse() but it
>> > seems
>> > to me that the best implementation of lazy loading in these two SeqIO
>> > functions would be significantly different. My initial impression of the
>> > project would be for SeqIO.parse() to stage a file segment and
>> > selectively
>> > generate information when called while SeqIO.index() would use a more
>> > detailed map created at instantiation to pull information selectively.
>>
>> We don't necessarily have to be restricted to SeqIO.index() objects
>> here. You'll notice of course that SeqIO.index() indexes complete
>> records without granularity up to the possible subsequences. What
>> we're looking for is compatibility with our existing SeqIO parsers.
>> The lazy parser may well be a new object implemented alongside SeqIO,
>> but the parsing logic itself (the one whose invocation is delayed by
>> the lazy parser) should rely on existing parsers.
>>
>> > 2) Is slower instantiation an acceptable trade-off for memory
>> > efficiency?
>> > In the current implementation of SeqIO.index(), sequence files are read
>> > twice, once to annotate beginning points of entries and a second time to
>> > load the SeqRecord requested by __getitem__(). A lazy-loading parser
>> > could
>> > amplify this issue if it works by indexing locations other than the
>> > start
>> > of the record. The alternative approach of passing the complete textual
>> > sequence record and selectively parsing would be easier to implement
>> > (and
>> > would include dual compatibility with parse and index) but it seems that
>> > it
>> > would be slower when called and potentially less memory efficient.
>>
>> I think this will depend on what you want to store in the indices and
>> how you store them, which will most likely differ per sequencing file
>> format. Coming up with this, we expect, is an important part of the
>> project implementation. Doing a first pass for indexing is acceptable.
>> Instantiation of the object using the index doesn't necessarily have
>> to be slow. Retrieval of the actual (sub)sequence will be slower since
>> we will touch the disk and do the actual parsing by then. But this can
>> also be improved, perhaps by caching the result so subsequent
>> retrieval is faster. One important point (and the use case that we
>> envision for this project) is that subsequences in large sequence
>> files (genome assemblies, for example) can be retrieved quite quickly.
>>
>> Take a look at some existing indexing implementations, such as
>> faidx[1] for FASTA files and BAM indexing[2]. Looking at the tabix[3]
>> tool may also help. The faidx indexing, for example, relies on the
>> FASTA file having the same line length, which means it can be used to
>> retrieve subsequences given only the file offset of a FASTA record.
>>
>> Hope this gives you some useful hints. Good luck with your proposal :).
>>
>> Cheers,
>> Bow
>>
>> [1] http://samtools.sourceforge.net/samtools.shtml
>> [2] http://samtools.github.io/hts-specs/SAMv1.pdf
>> [3] http://bioinformatics.oxfordjournals.org/content/27/5/718
>
>