[Biopython-dev] Gsoc 2014: another aspirant here

Thu Mar 13 09:04:16 UTC 2014

Hi Evan,

Thank you for your interest in the project :). It's good to know
you're already quite familiar with SeqIO as well.

My replies are below.

> 1) Should the lazy loading be done primarily in the context of records
> returned from the SeqIO.index() dict-like object, or should the lazy
> loading be available to the generator made by SeqIO.parse()? The project
> idea in the wiki mentions adding lazy-loading to SeqIO.parse() but it seems
> to me that the best implementation of lazy loading in these two SeqIO
> functions would be significantly different. My initial impression of the
> project would be for SeqIO.parse() to stage a file segment and selectively
> generate information when called while SeqIO.index() would use a more
> detailed map created at instantiation to pull information selectively.

We don't necessarily have to be restricted to SeqIO.index() objects
here. You'll notice of course that SeqIO.index() indexes complete
records without granularity up to the possible subsequences. What
we're looking for is compatibility with our existing SeqIO parsers.
The lazy parser may well be a new object implemented alongside SeqIO,
but the parsing logic itself (the one whose invocation is delayed by
the lazy parser) should rely on existing parsers.

> 2) Is slower instantiation an acceptable trade-off for memory efficiency?
> In the current implementation of SeqIO.index(), sequence files are read
> twice, once to annotate beginning points of entries and a second time to
> load the SeqRecord requested by __getitem__(). A lazy-loading parser could
> amplify this issue if it works by indexing locations other than the start
> of the record. The alternative approach of passing the complete textual
> sequence record and selectively parsing would be easier to implement (and
> would include dual compatibility with parse and index) but it seems that it
> would be slower when called and potentially less memory efficient.

I think this will depend on what you want to store in the indices and
how you store them, which will most likely differ per sequencing file
format. Coming up with this, we expect, is an important part of the
project implementation. Doing a first pass for indexing is acceptable.
Instantiation of the object using the index doesn't necessarily have
to be slow. Retrieval of the actual (sub)sequence will be slower since
we will touch the disk and do the actual parsing by then. But this can
also be improved, perhaps by caching the result so subsequent
retrieval is faster. One important point (and the use case that we
envision for this project) is that subsequences in large sequence
files (genome assemblies, for example) can be retrieved quite quickly.

Take a look at some existing indexing implementations, such as
faidx[1] for FASTA files and BAM indexing[2]. Looking at the tabix[3]
tool may also help. The faidx indexing, for example, relies on the
FASTA file having the same line length, which means it can be used to
retrieve subsequences given only the file offset of a FASTA record.

Hope this gives you some useful hints. Good luck with your proposal :).

Cheers,
Bow

[1] http://samtools.sourceforge.net/samtools.shtml
[2] http://samtools.github.io/hts-specs/SAMv1.pdf
[3] http://bioinformatics.oxfordjournals.org/content/27/5/718