[Biopython-dev] Project ideas for GSoC (or other student projects)

Thu Mar 21 17:36:24 UTC 2013

On Thu, Mar 21, 2013 at 4:29 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Mar 21, 2013 at 4:11 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> Right now we need to put this list of ideas on the wiki page (ready
>> for combining into the OBF page which will be shown to Google
>> to make our case for taking part in the GSoC 2013 program).
>> http://biopython.org/wiki/Google_Summer_of_Code
>>
>> If any of you as a potential mentor want to put up an outline
>> proposal, even better.
>>
>
> I've been wondering about potential GSoC projects which I'd
> be interested in mentoring (or co-mentoring), and thus far I've
> only got one outline idea.
>
> I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
> functionality (which does whole record parsing on demand)
> and extending this with lazy-loading or lazy-parsing (which
> has precedent in our BioSQL wrappers). For example, with
> whole genome FASTA files you may never need to load the
> entire sequence, but using an index system like tabix (or
> even actually using a tabix index) Biopython could provide
> a lazy-loading Seq object which extracts only the sequence
> region of interest on demand.
>
> The same idea applies to richer file formats too, like EMBL
> and GenBank. ...
>
> Likewise, this makes sense for GTF/GFF/GFF3 ...

P.S. An example use case, http://www.biostars.org/p/64363/

Part of this work could include enhancements to the SeqRecord
handling of SeqFeatures - offering more than just the current
simple list - for example lookup by ID, dbxref, or position. That
would be nice to have now with the current in-memory parsers.

An old but still relevant example usecase:
http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features

Regards,

Peter