[BioRuby] GSOC

Loris Cro l.cro at campus.unimib.it
Fri May 23 13:34:24 UTC 2014


Hi Eric,

I did look at GEMINI.
There are some differences in both finality of the projects and u
nderlying database that make the projects not very similar in
terms of data model. More specifically:

* GEMINI is only for human data and thus it makes a lot of
assumptions on what data goes in and what users want to
know.

* GEMINI uses sqlite as a database engine.

In our case (maybe Francesco can correct me and/or expand on that)
the goal is to make 0 assumptions on the contents of the VCF
files and more generally make the system as easy to use as
possible (so users should not have to deal with SQL, for example).

Regarding my miniparser, it's just a test to see how fast a parser could
get (almost 400k records/second using pypy :) ). Mainly I wanted a
practical understanding of the VCF specification.
So I really haven't made any improvement, I just tried to take shortcuts
whenever possible.


2014-05-22 1:40 GMT+02:00 Eric Talevich <eric.talevich at gmail.com>:

> On Mon, May 12, 2014 at 9:12 AM, Loris Cro <l.cro at campus.unimib.it> wrote:
>
>> I'm trying to write a list of all the problems that must be addressed:
>>
>>
>> https://github.com/kappaloris/GSoC-2014-OBF/blob/master/problems-features.md
>>
>> For now I believe I should try to fill the first section as much as
>> possible and
>> I wouldn't mind some input in that regard.
>>
>> I stubbed a possible data model that would preserve all the informations
>> present in the VCF files, considering also the possibility of having
>> multiple
>> reference genomes inside a single collection.
>>
>> https://gist.github.com/kappaloris/462082314dc2e940ba4e
>>
>> How to merge the results of queries is still TBD, tho.
>>
>>
> Did you look at GEMINI's data model yet?
> https://github.com/arq5x/gemini/blob/master/gemini/database.py
>
> This system is in active use so it should be able to cover a fair number
> of real-world edge cases.
>
> Also, since you're coding in Python here, have you considered using PyVCF
> or the unmerged Biopython one that was written for a previous GSoC?
> https://github.com/lennax/biopython/tree/variant2/Bio/Variant
>
> If you make improvements in either of those to improve robustness,
> upstream would probably appreciate your patches.
>



More information about the BioRuby mailing list