[GSoC] BioRuby project proposal & VCF search API inquiry

Sun Mar 16 17:44:17 UTC 2014

Hello, I'm Loris Cro, 3rd year CS student at UNIMIB (http://unimib.it).

I write to both get some more information about the official project idea
(and provide a possible solution while doing so), and discuss a different
proposal.

In the interest of time I offer only a short introduction of myself.

- I have an adequate level of the required soft and technical skills (as
  described in the student guide), hopefully this writing will attest to
that.
- I have minor experience with the main programming paradigms /
  abstraction levels (C, Java, Ruby, LISP, Prolog) with Python being
  my programming language of choice.
- I have a good amount of practical experience with nosql data modelling,
  mainly from the necessity of building an online RSS aggregator on Google
  App Engine (that offers a nosql database built on BigTable).
- Regarding my bioinformatic preparation, I'm only ~4.2% "bio", so I'll need
  some help to understand the actual workflows/use-cases and the
  semantic of some, if not most, data formats will not be instantly clear
to me.

Now, regarding the proposals, I'll start with the new one:

Would you be interested in a GFF3 or VCF parser that runs on the GPU or,
alternatively, a binary format for GFF3?

About the "official" idea:
What is the exact level of speed/scalability/flexibility you need?

I'll assume, from what I understood by reading the rationale, that:
- you are talking about an arbitrarily large dataset (so no REDIS),
- users should be able to do complex searches but they are not necessarily
expected to
  build queries by hand (meaning that we aim to offer an easy (although
extensible)
  interface that covers the most common use-case / allows the user to
filter out
  enough data to do the rest of the job in-memory).
  For example a mutations_diff(sample_list_A, sample_list_B) method
(accessible via
  REST ofc).

Given those assumptions I think the solution could be dissected as follows:

[1] A component that reads VCF files and loads them into the database.

[2] The database itself.

[3] A component that exposes the operations that the database doesnt offer.

The [3] component is the one you propose to build in jruby or scala.
Why would you want to build [3] entirely by hand?
If the amount of data is in the 1000 * few-gigabytes ballpark, you will
inevitably end up with a mapreduce-y solution (please do correct me if I'm
wrong).
The items get partitioned in N groups, local filtering/aggregation is
applied,
more partitioning and a last join of the partial results, or something like
that.
How about using a different approach?
We find a "mapreduce engine" and implement our operations as
"recipes" for such engine.

Some DBs that offer such functionality: MongoDB, CouchDB, ArangoDB.
If you want a bigger scale approach Apache Accumulo might be a possible
candidate.

Please check out the ArangoDB site, it offers support for both JOINs and
custom mapreducey scripts (written in either JS or Ruby!) + other amenities.

In this case we could focus on finding a good data representation, building
a fast
importer and the scripts for the most common operations. ArangoDB even
offers some
nice scaffolding tools for the RESTful API, so extensions would be dead
easy to build
(if the data representation is decent).

That said, while I have more experience working with the latter kind of
projects, I
think I would be a decent fit for the former too. In fact I plan to make my
proposal
count as the required internship to graduate so I would also get some help
from within
my university.

Let me know, the deadline for proposals is fast approaching :)
Loris Cro