[Biopython-dev] Gsoc 2014: another aspirant here

Thu Mar 13 00:06:51 UTC 2014

Hello,

My name is Evan Parker, I am a third year graduate student studying
analytical chemistry at UC Davis. Coding was my hobby in undergrad and has
become a major component of my current graduate work in the context of
mass-spectral interpretation software. I use Biopython for parsing Uniprot
sequence data/annotations and I would be delighted to have the opportunity
give back, especially under the umbrella of the Google Summer of Code.

The project on implementing an indexing & lazy-loading sequence parser
looks interesting to me and, while difficult, it is something that I could
wrap my mind around. I apologize in advance for the wall of text but if you
have the time I'd like to ask a couple of questions relating to
implementation as I prepare my proposal.

1) Should the lazy loading be done primarily in the context of records
returned from the SeqIO.index() dict-like object, or should the lazy
loading be available to the generator made by SeqIO.parse()? The project
idea in the wiki mentions adding lazy-loading to SeqIO.parse() but it seems
to me that the best implementation of lazy loading in these two SeqIO
functions would be significantly different. My initial impression of the
project would be for SeqIO.parse() to stage a file segment and selectively
generate information when called while SeqIO.index() would use a more
detailed map created at instantiation to pull information selectively.

2) Is slower instantiation an acceptable trade-off for memory efficiency?
In the current implementation of SeqIO.index(), sequence files are read
twice, once to annotate beginning points of entries and a second time to
load the SeqRecord requested by __getitem__(). A lazy-loading parser could
amplify this issue if it works by indexing locations other than the start
of the record. The alternative approach of passing the complete textual
sequence record and selectively parsing would be easier to implement (and
would include dual compatibility with parse and index) but it seems that it
would be slower when called and potentially less memory efficient.

Any of your thoughts and comments are appreciated,

- Evan