[BioRuby] GSoC 2014 queries and inputs

Fri Mar 14 19:17:45 UTC 2014

Hi,
My name is Ujjwal. I'm a 21 years old student from India and interested in
contributing to Bioruby this year. I have certain queries regarding the
project idea listed.

   1. Can you give me some more use cases for this tool. Some specific
   functional requirements we'd like to see. What we need to mine determines
   the data structure of our persistence layer and therefore which database
   engine to use.
   2. When you say a RESTful api, we want to deploy this on a server with a
   backing database together with a ruby gem that communicates with the api
   right? And I presume we also want people to be able to make comparison of
   our hosted VFC files with their local VCF files
   3. Although this is a *Bioruby* project, the server doesn't necessarily
   need to be written in Ruby I presume? As is mentioned, Scala or JRuby could
   be used. I would suggest we have a look at Go lang too.

To give you a background about me. I was a GSoC intern last year for Ruby
on Rails where I implemented a RESTful collection routing api. I am an
intermediate ruby programmer. I have also been interested in synthetic
biology for about a year now and have some lab experience too so I
understand the basics of biology and specifically genetic engineering. I am
a computer science undergrad and have taken a course on data engineering
too. I also have experience working with REST apis and am building one
right now for my startup.

I have been wondering on the database. I think Neo4J will be a great fit.
It's not heavy like oracle and does not need installation. It's portable
and can be started and stopped easily on the machine. Has low memory
footprint and support for SPARQL too although it's native query language
Cypher will do the trick for us right now. We can run embedded instances
too using JRuby which are super fast. I'm the maintainer of the most
popular Neo4j ruby bindings and also in the process of rewriting the next
version of neo4j-core. It will allow us to make all sorts of queries and do
data mining at an incredible speed while being incredibly portable and
light. All logic can then reside within the gem itself and we do not need
any backend. It should be fast enough since we'll be directly dealing with
java objects made available through jruby. I have a fair idea of how fast
this is and its really fast although working with such huge files will have
different challenges. We don't need a database for the embedded version.
All we need is jars which fortunately are available as a gem so all we have
to do is include them as dependencies and our database is ready! I don't
think it will be this easy for any other db while giving us the same speed,
power and capabilities!

I've started working on the proposal and will upload it in a couple of days
for your feedback. This is going to be incredibly fun :)

BTW what is the user base of bioruby like? What does it lack from other bio
libraries like biopython?

How much biology do I need to understand for this project or will I learn
as we go along?

-- 
Thanks
Ujjwal