[BioRuby] GSoC 2014 queries and inputs

Sat Mar 15 15:36:38 UTC 2014

+1 for RDF.

On Sat, Mar 15, 2014 at 09:35:30AM +0100, Francesco Strozzi wrote:
> Hi Ujjwal,
> If you can please join also the gsoc at lists.open-bio.org mailing list, as
> there are other discussions going on. To answer to your questions:
> 
> 1) The typical use-case I can think of is the one where you want to put
> millions of SNPs from the VCF files into a database engine, to perform
> queries on these data like "give me all the SNPs which have this particular
> allele in these 50 samples" or "give me all the SNPs that have been
> annotated for having a particular effect on the genes where they are
> present" or again "give me all the SNPs that have a particular allele in
> these 50 samples but have another allele in the remaining samples" etc.
> Doing this sort of queries by keeping the data in the raw VCF file format
> can be a pain, we need to make it easier :-)
> 
> 2) Yes the idea is that, so that we could have a single bundle with
> everything. Anyway the idea is to provide a system that people can also
> host on their own. Think of it as a fast system that bioinformaticians can
> deploy at once and where they can put all the VCF data they need to start
> mining the information. Maybe a bit ambitious but sure it will be fun to
> implement :-)
> 
> 3) Absolutely, my suggestion will be to do the hard part in Scala or Go or
> whatever JVM-based language you like and then provide access to this
> through JRuby by making a simple Ruby gem at the end of the project.
> 
> I don't know if you are interested in semantic web approaches, but the OBF
> as a whole is starting doing some serious work on this, so one possibility
> could be also to use a database engine that can support SPARQL as a query
> language (Oracle NoSQL database can do this) so that we can put the VCF
> data in it and then perform queries via SPARQL. This will of course make it
> rather difficult to bundle everything (server + frontend) together but
> could turn out to be a valuable approach in the long-term. Of course having
> a database backend is not required at all and one can always leave the data
> in the original files and provide a higher interface to access them.
> 
> All the best.
> Francesco
> 
> 
> On Fri, Mar 14, 2014 at 8:17 PM, Ujjwal Thaakar <ujjwalthaakar at gmail.com>wrote:
> 
> > Hi,
> > My name is Ujjwal. I'm a 21 years old student from India and interested in
> > contributing to Bioruby this year. I have certain queries regarding the
> > project idea listed.
> >
> >    1. Can you give me some more use cases for this tool. Some specific
> >    functional requirements we'd like to see. What we need to mine
> > determines
> >    the data structure of our persistence layer and therefore which database
> >    engine to use.
> >    2. When you say a RESTful api, we want to deploy this on a server with a
> >    backing database together with a ruby gem that communicates with the api
> >    right? And I presume we also want people to be able to make comparison
> > of
> >    our hosted VFC files with their local VCF files
> >    3. Although this is a *Bioruby* project, the server doesn't necessarily
> >    need to be written in Ruby I presume? As is mentioned, Scala or JRuby
> > could
> >    be used. I would suggest we have a look at Go lang too.
> >
> > To give you a background about me. I was a GSoC intern last year for Ruby
> > on Rails where I implemented a RESTful collection routing api. I am an
> > intermediate ruby programmer. I have also been interested in synthetic
> > biology for about a year now and have some lab experience too so I
> > understand the basics of biology and specifically genetic engineering. I am
> > a computer science undergrad and have taken a course on data engineering
> > too. I also have experience working with REST apis and am building one
> > right now for my startup.
> >
> > I have been wondering on the database. I think Neo4J will be a great fit.
> > It's not heavy like oracle and does not need installation. It's portable
> > and can be started and stopped easily on the machine. Has low memory
> > footprint and support for SPARQL too although it's native query language
> > Cypher will do the trick for us right now. We can run embedded instances
> > too using JRuby which are super fast. I'm the maintainer of the most
> > popular Neo4j ruby bindings and also in the process of rewriting the next
> > version of neo4j-core. It will allow us to make all sorts of queries and do
> > data mining at an incredible speed while being incredibly portable and
> > light. All logic can then reside within the gem itself and we do not need
> > any backend. It should be fast enough since we'll be directly dealing with
> > java objects made available through jruby. I have a fair idea of how fast
> > this is and its really fast although working with such huge files will have
> > different challenges. We don't need a database for the embedded version.
> > All we need is jars which fortunately are available as a gem so all we have
> > to do is include them as dependencies and our database is ready! I don't
> > think it will be this easy for any other db while giving us the same speed,
> > power and capabilities!
> >
> > I've started working on the proposal and will upload it in a couple of days
> > for your feedback. This is going to be incredibly fun :)
> >
> > BTW what is the user base of bioruby like? What does it lack from other bio
> > libraries like biopython?
> >
> > How much biology do I need to understand for this project or will I learn
> > as we go along?
> >
> > --
> > Thanks
> > Ujjwal
> > _______________________________________________
> > BioRuby Project - http://www.bioruby.org/
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> 
> 
> 
> -- 
> 
> Francesco Strozzi
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>