[Biopython-dev] GSoC draft proposal - lazy loading SeqIO parsers

Wed Mar 19 17:26:10 UTC 2014

On Wed, Mar 19, 2014 at 4:49 PM, Evan Parker <eparker at ucdavis.edu> wrote:
> Hi all,
>
> I have a rough draft of my GSoC proposal and would appreciate comments from
> anybody who might be willing to eventually mentor this project, or anybody
> who has opinions on implementation. It's about 3 pages of text + several
> figures.
>
> I'll be submitting a final draft Friday on the GSoC website pending your
> comments.
>
> Thank you,
> -Evan

Hi Evan,

That's a nice job so far - although questions about your time
availability will be raised (sadly the GSoC schedule isn't fair to
students depending on regional University term schedules).
However, you are a PhD student (which is normally full time).
You will need to clear this with your PhD supervisors - since
you would be spending a large chunk of time not working
directly on your thesis project, and there can be strict
deadlines for completion.

Here's a selection of points in no particular order:

Have you looked at Bio.SeqIO.index_db(...) which works
like Bio.SeqIO.index(...) but stores the offsets etc in an
SQLite database?

When pondering how to design this kind of thing myself,
I had suspected multiple SeqRecProxy classes might be
needed (one per file format potentially), although run
time selection of internal parsing methods might work too.

I would also ask why not have the slicing of a SeqRecProxy
return another SeqRecProxy? This means creating a new
proxy object with different offset values - but would be fast.
Only when the seq/annotation/etc is accessed would the
proxy have to go to the disk drive. This becomes more
interesting when accessing the features in the slice of
interest (e.g. if the full record was for a whole chromosome
and only region [1000:2000] was of interest).

This idea about windows onto the data is key to how
the SAM/BAM file format is used (coordinate sorting
with an index). Are you familiar with that, or tabix?

Another open question is what to do with file handles -
specifically the question of when to close them? e.g.
via garbage collection, context managers, etc. See
for example this blog post - the lazy parsing approach
may result in ResourceWarnings as a side effect:
http://emptysqua.re/blog/against-resourcewarnings-in-python-3/

I appreciate you are unlikely to have ready answers to
all of that - I've probably given you a whole load more
background reading. I hope some of the other Biopython
developers (or GSoC mentors on other OBF projects -
you could post this to the OBF GSoC mailing list too)
will have further feedback.

Regards,

Peter