[Open-bio-l] Fwd: GSOC (VCF DBMS)

Tue Apr 29 22:04:33 UTC 2014

Hi Loris,

You might have a look at another project for querying (moderately) large
numbers of VCF files, GEMINI:
http://gemini.readthedocs.org/en/latest/
http://dx.plos.org/10.1371/journal.pcbi.1003153
https://github.com/arq5x/gemini

As understand it, your project aims to take this general concept and make
it scalable to even larger numbers of variants and samples by using more
sophisticated database techniques. Is that right?

If the project is very similar to GEMINI in concept, it may even be best to
focus on implementing this functionality for BioRuby instead of creating a
competing project in Python. Or code the databasing side of it in your
language of choice as a separate "service" (using Python, Go, Node.js,
whatever), and also produce a Ruby API for it. Remember that your mentors
will be most effective in helping you with Ruby code (or another language
they know well), and it doesn't hurt to learn another language for GSoC if
you're at least able to do it competently.

Cheers,
Eric

On Tue, Apr 29, 2014 at 11:06 AM, Loris Cro <l.cro at campus.unimib.it> wrote:

> Hi, we're discussing the implementation of my proposal on
> the bioruby ml.
>
> There are some news regarding the problem of computing
> privates that might be of special interest.
>
> Feel free to join the conversation.
>
> ---------- Forwarded message ----------
> From: Loris Cro <l.cro at campus.unimib.it>
> Date: 2014-04-29 19:46 GMT+02:00
> Subject: Re: GSOC
> To: Pjotr Prins <pjotr.public14 at thebird.nl>
> Cc: bioruby at lists.open-bio.org
>
>
> Hi all! Let me start the discussion with some info about what I've
> done, what I'm planning to do next and what questions I need help with.
>
>
> As far as I understand, the project idea published on the OBF
> wiki was primarily to answer the problem of computing privates.
> There are other features that you are interested in, but this
> specific problem was the biggest pain point. I say "was" because
> in fact computing privates is not that hard in the end (no JOINs or
> heavy denormalization required anymore) as you can see by
> reading:
>
>       https://gist.github.com/kappaloris/11356517
>
> In fact now it seems to me that this tool would be best implemented
> as a library (with support for all the features mentioned in the gist) to
> be used in conjunction with tabix. (If anyone wants to help me write it
> in python, I reserved the name PrivatePy on pypi :3. I don't want to
> commit this early to extra work so, if you like the idea, please offer
> some help, I still have a DBMS to think about :) If you want to write
> it in Ruby I can still help, ofc, no cool name tho).
>
> I prefere python because it seems to me that python is the language
> with the most educational value since the "private" concept is not
> private to biology alone and also it's the language I know best (and
> accordingly I can help most effectively with).
>
> Nevertheless, as I stated in my proposal, this script can also
> be implemented as a processing step during the import to the DBMS.
> Unless you're working with really huge amounts of data, you shouldn't
> expect the DBMS to be faster than a command-line utility, tho.
>
> Now, what I want to understand is how exactly VCF files constitute
> a bottleneck:
>
> 1. Regarding performance: are there other computationally heavy
>     operations (like privates once were :D )? Mixing filtering and other
>     "by row" rules doesn't really count as 'heavy', I'm talking about ugly
>     cross-referencing business.
>
> 2. Regarding current cases: what operations are really easy but made
>     tedious by lack of proper interfaces / inconsistent formats / ... that
>     this system should be expected to offer? An example would be the
>     possibility of doing a "walk-together" import of multiple VCF files.
>     This would also be extremely beneficial for making private-indexing
>     faster.
>
> 3. Regarding new cases: what new features should be considered a
>     must-have? For example, 1-click scalability? If you don't have
>     already a specific idea, don't worry, as the other details fall into
>     place I will offer some ideas depending on what the most plausible
>     solutions might offer.
>
> 4. Regarding ???: are there any other aspects that I'm missing?
>
>
> Please note that [1] is what i understand best. [2] Especially is not
> easy for me: I don't work in a sequencing center so if you want to
> point something out please add a little context and don't be afraid
> to paste some example code of what you are doing now and how
> you think it should be done if you had this system already available.
>
> As of now, other than do some more exploring on my solution for
> computing privates, this is the first hurdle to jump. Talking with
> Francesco, it seemed that privates where the biggest computational
> problem, meaning that, unless someone points to something that
> I'm missing, the focus should be more on ease of use and less on
> "raw" performance (since every DBMS has its own kinks).
>
> Next week I will start writing the blog and publish some other
> information about what I'm doing to make easier following the
> development.
>
>
>
> Thank you all for your time and please don't bother wasting time
> on courtesy/etiquette: if you have any objections share them as
> soon as possible, I don't mind criticism and "form" is just a prefix
> of formalism.
>
>
>
>
> 2014-04-27 8:02 GMT+02:00 Pjotr Prins <pjotr.public14 at thebird.nl>:
>
> Hi Lori,
> >
> > Congrats from BioRuby for being accepted as a GSoC student this year!
> >
> > To all others: Lori is going to work on a scalable NoSQL VCF container
> > for BioRuby with Francesco as a primary mentor. I am pretty excited
> > about this project - VCF parsing is quite a bottleneck in many
> > sequencing centers (including ours) and a NoSQL solution may just be
> > the right idea.
> >
> > From here on we will discuss this project on this ML.
> >
> > Lori, I am on IRC this morning, if you like to chat.
> >
> > Pj.
> >
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>