[Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent

Thu May 2 09:52:19 UTC 2013

On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:
> Hi Peter and all,
> Thanks for the long explanation.
> I got much better understand of this project though I am still confusing on
> how to implement the lazy-loading parser for feature rich files (EMBL,
> GenBank, GFF3).
> Since the deadline is pretty close,I decided to post my premature of
> proposal for this project. It would be great if you all can given me some
> comments and suggestions. The proposal is available here.
> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
> Thank you all in advance.
>
> Zhigang

Hi Zhigang,

I've posted a few comment there, but it would be a good idea
to put the draft on Google Melange soon. I see you've posted
the Google Doc on the NESCent Google+ as well, good.

Looking at the current draft, you don't yet have a timeline. This
is vital - and it should include writing tests (as you write code -
not all at the end) and documentation (which can come after
the code).

In the community bonding period you could write that you
plan to setup your development environment including
multiple versions of Python (at least Python 2.6, Python 3,
Jython 2.7, and PyPy 2.0 to cover the main variants).

For instance, it would make sense to start with learning about
faidx and how its indexing works, and trying to reproduce it in
Python code, and then wrapping that in a SeqRecord style
API. Include writing and evaluating some benchmarks too -
you may need to learn how to profile Python code for this,
since speed and performance is one the reasons for wanting
lazy loading (lower memory usage is the other main driver).
That could be the first few weeks perhaps?

Regards,

Peter