[BioRuby] GSoC 2014 queries and inputs

Francesco Strozzi francesco.strozzi at gmail.com
Sat Mar 15 08:35:30 UTC 2014


Hi Ujjwal,
If you can please join also the gsoc at lists.open-bio.org mailing list, as
there are other discussions going on. To answer to your questions:

1) The typical use-case I can think of is the one where you want to put
millions of SNPs from the VCF files into a database engine, to perform
queries on these data like "give me all the SNPs which have this particular
allele in these 50 samples" or "give me all the SNPs that have been
annotated for having a particular effect on the genes where they are
present" or again "give me all the SNPs that have a particular allele in
these 50 samples but have another allele in the remaining samples" etc.
Doing this sort of queries by keeping the data in the raw VCF file format
can be a pain, we need to make it easier :-)

2) Yes the idea is that, so that we could have a single bundle with
everything. Anyway the idea is to provide a system that people can also
host on their own. Think of it as a fast system that bioinformaticians can
deploy at once and where they can put all the VCF data they need to start
mining the information. Maybe a bit ambitious but sure it will be fun to
implement :-)

3) Absolutely, my suggestion will be to do the hard part in Scala or Go or
whatever JVM-based language you like and then provide access to this
through JRuby by making a simple Ruby gem at the end of the project.

I don't know if you are interested in semantic web approaches, but the OBF
as a whole is starting doing some serious work on this, so one possibility
could be also to use a database engine that can support SPARQL as a query
language (Oracle NoSQL database can do this) so that we can put the VCF
data in it and then perform queries via SPARQL. This will of course make it
rather difficult to bundle everything (server + frontend) together but
could turn out to be a valuable approach in the long-term. Of course having
a database backend is not required at all and one can always leave the data
in the original files and provide a higher interface to access them.

All the best.
Francesco


On Fri, Mar 14, 2014 at 8:17 PM, Ujjwal Thaakar <ujjwalthaakar at gmail.com>wrote:

> Hi,
> My name is Ujjwal. I'm a 21 years old student from India and interested in
> contributing to Bioruby this year. I have certain queries regarding the
> project idea listed.
>
>    1. Can you give me some more use cases for this tool. Some specific
>    functional requirements we'd like to see. What we need to mine
> determines
>    the data structure of our persistence layer and therefore which database
>    engine to use.
>    2. When you say a RESTful api, we want to deploy this on a server with a
>    backing database together with a ruby gem that communicates with the api
>    right? And I presume we also want people to be able to make comparison
> of
>    our hosted VFC files with their local VCF files
>    3. Although this is a *Bioruby* project, the server doesn't necessarily
>    need to be written in Ruby I presume? As is mentioned, Scala or JRuby
> could
>    be used. I would suggest we have a look at Go lang too.
>
> To give you a background about me. I was a GSoC intern last year for Ruby
> on Rails where I implemented a RESTful collection routing api. I am an
> intermediate ruby programmer. I have also been interested in synthetic
> biology for about a year now and have some lab experience too so I
> understand the basics of biology and specifically genetic engineering. I am
> a computer science undergrad and have taken a course on data engineering
> too. I also have experience working with REST apis and am building one
> right now for my startup.
>
> I have been wondering on the database. I think Neo4J will be a great fit.
> It's not heavy like oracle and does not need installation. It's portable
> and can be started and stopped easily on the machine. Has low memory
> footprint and support for SPARQL too although it's native query language
> Cypher will do the trick for us right now. We can run embedded instances
> too using JRuby which are super fast. I'm the maintainer of the most
> popular Neo4j ruby bindings and also in the process of rewriting the next
> version of neo4j-core. It will allow us to make all sorts of queries and do
> data mining at an incredible speed while being incredibly portable and
> light. All logic can then reside within the gem itself and we do not need
> any backend. It should be fast enough since we'll be directly dealing with
> java objects made available through jruby. I have a fair idea of how fast
> this is and its really fast although working with such huge files will have
> different challenges. We don't need a database for the embedded version.
> All we need is jars which fortunately are available as a gem so all we have
> to do is include them as dependencies and our database is ready! I don't
> think it will be this easy for any other db while giving us the same speed,
> power and capabilities!
>
> I've started working on the proposal and will upload it in a couple of days
> for your feedback. This is going to be incredibly fun :)
>
> BTW what is the user base of bioruby like? What does it lack from other bio
> libraries like biopython?
>
> How much biology do I need to understand for this project or will I learn
> as we go along?
>
> --
> Thanks
> Ujjwal
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>



-- 

Francesco Strozzi



More information about the BioRuby mailing list