[GSoC] GSoC 2014 queries and inputs

Ujjwal Thaakar ujjwalthaakar at gmail.com
Tue Mar 18 16:00:59 UTC 2014


I think you're right. It makes more sense to focus first on the functional
requirements of the project and then move onto implementing a cross
platform api by writing a custom parser. It's tempting to write bindings to
an existing C library but the costs involved need to be evaluated.


On 18 March 2014 17:38, Fields, Christopher J <cjfields at illinois.edu> wrote:

> Htslib also had vcf support.  One advantage there might be that additional
> format support could be added at some point.  Not sure how the community at
> large views it though...
>
> Chris
>
> Sent from my iPad
>
> > On Mar 18, 2014, at 4:31 AM, "Francesco Strozzi" <
> francesco.strozzi at gmail.com> wrote:
> >
> > Hi Ujjwal,
> > consider that BioRuby itself is only 100% compatible with CRuby and
> almost
> > fully compatible with JRuby (there are few libraries which do not work).
> > The idea here should be to provide a higher interface to manage and query
> > VCF data and so my advise is to try not to spend too much time on parsing
> > issues and instead reuse existing code and libraries. I think we can live
> > with a JRuby only implementation, since you also proposed to use Neo4J
> and
> > the possibility to pack everything in a jar may sound tempting in the end
> > :).
> > But if you would like to implement something that can work across
> multiple
> > Ruby implementations I think there are two ways:
> > 1) you can write a simple parser in plain Ruby, VCF are just TSV files so
> > it's pretty straight forward. But implementing a solid parser which can
> > handle every aspect of the information stored in VCF files still will
> > require some time and testing.
> > 2) you can look at existing C libraries and write a binding using the
> Ruby
> > FFI. This extension will be usable both by CRuby and JRuby. If this
> sounds
> > interesting, I will suggest looking into VCFLIB (
> > https://github.com/ekg/vcflib).
> >
> > In the end these options may sound like GSoC projects on their own, so if
> > you would like to follow one or the other, I suggest you to try and
> balance
> > this work with the rest of the things to do on the project, to build a
> > solid work plan.
> >
> > All the best.
> > Francesco
> >
> >
> > On Mon, Mar 17, 2014 at 9:37 PM, Ujjwal Thaakar <ujjwalthaakar at gmail.com
> >wrote:
> >
> >> If its fine to have a JRuby only implementation then we definitely
> write a
> >> thin wrapper over Picard
> >>
> >>
> >>> On 18 March 2014 01:56, Ujjwal Thaakar <ujjwalthaakar at gmail.com>
> wrote:
> >>>
> >>> When we say BioRuby I think it should work with Ruby - CRuby, JRuby,
> >>> Rubinius etc. I'm not sure it's a good idea to constrain people to
> JRuby!
> >>>
> >>>
> >>> On 18 March 2014 01:48, Francesco Strozzi <francesco.strozzi at gmail.com
> >wrote:
> >>>
> >>>> I don't think it's necessary.  If you would like to use JRuby, there
> is
> >>>> the Picard API ( http://picard.sourceforge.net ) which you can reuse
> >>>> right away. It's fast and well tested.
> >>>>
> >>>> All the best.
> >>>> Francesco
> >>>> Il 17/mar/2014 20:38 "Ujjwal Thaakar" <ujjwalthaakar at gmail.com> ha
> >>>> scritto:
> >>>>
> >>>>> Would we have to write a new VCF parser in Ruby?
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On 15 March 2014 17:33, Ujjwal Thaakar <ujjwalthaakar at gmail.com>
> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>> My name is Ujjwal. I'm a 21 years old student from India and
> >>>>> interested in
> >>>>>> contributing to Bioruby this year. I have certain queries regarding
> >>>>> the
> >>>>>> project idea listed.
> >>>>>>
> >>>>>>   1. Can you give me some more use cases for this tool. Some
> specific
> >>>>>
> >>>>>>   functional requirements we'd like to see. What we need to mine
> >>>>> determines
> >>>>>>   the data structure of our persistence layer and therefore which
> >>>>> database
> >>>>>>   engine to use.
> >>>>>>   2. When you say a RESTful api, we want to deploy this on a server
> >>>>> with
> >>>>>
> >>>>>>   a backing database together with a ruby gem that communicates with
> >>>>> the api
> >>>>>>   right? And I presume we also want people to be able to make
> >>>>> comparison of
> >>>>>>   our hosted VFC files with their local VCF files
> >>>>>>   3. Although this is a *Bioruby* project, the server doesn't
> >>>>>
> >>>>>>   necessarily need to be written in Ruby I presume? As is mentioned,
> >>>>> Scala or
> >>>>>>   JRuby could be used. I would suggest we have a look at Go lang
> too.
> >>>>>>
> >>>>>> To give you a background about me. I was a GSoC intern last year for
> >>>>> Ruby
> >>>>>> on Rails where I implemented a RESTful collection routing api. I am
> an
> >>>>>> intermediate ruby programmer. I have also been interested in
> synthetic
> >>>>>> biology for about a year now and have some lab experience too so I
> >>>>>> understand the basics of biology and specifically genetic
> >>>>> engineering. I am
> >>>>>> a computer science undergrad and have taken a course on data
> >>>>> engineering
> >>>>>> too. I also have experience working with REST apis and am building
> one
> >>>>>> right now for my startup.
> >>>>>>
> >>>>>> I have been wondering on the database. I think Neo4J will be a great
> >>>>> fit.
> >>>>>> It's not heavy like oracle and does not need installation. It's
> >>>>> portable
> >>>>>> and can be started and stopped easily on the machine. Has low memory
> >>>>>> footprint and support for SPARQL too although it's native query
> >>>>> language
> >>>>>> Cypher will do the trick for us right now. We can run embedded
> >>>>> instances
> >>>>>> too using JRuby which are super fast. I'm the maintainer of the most
> >>>>>> popular Neo4j ruby bindings and also in the process of rewriting the
> >>>>> next
> >>>>>> version of neo4j-core. It will allow us to make all sorts of queries
> >>>>> and do
> >>>>>> data mining at an incredible speed while being incredibly portable
> and
> >>>>>> light. All logic can then reside within the gem itself and we do not
> >>>>> need
> >>>>>> any backend. It should be fast enough since we'll be directly
> dealing
> >>>>> with
> >>>>>> java objects made available through jruby. I have a fair idea of how
> >>>>> fast
> >>>>>> this is and its really fast although working with such huge files
> >>>>> will have
> >>>>>> different challenges. We don't need a database for the embedded
> >>>>> version.
> >>>>>> All we need is jars which fortunately are available as a gem so all
> >>>>> we have
> >>>>>> to do is include them as dependencies and our database is ready! I
> >>>>> don't
> >>>>>> think it will be this easy for any other db while giving us the same
> >>>>> speed,
> >>>>>> power and capabilities!
> >>>>>>
> >>>>>> I've started working on the proposal and will upload it in a couple
> of
> >>>>>> days for your feedback. This is going to be incredibly fun :)
> >>>>>>
> >>>>>> BTW what is the user base of bioruby like? What does it lack from
> >>>>> other
> >>>>>> bio libraries like biopython?
> >>>>>>
> >>>>>> How much biology do I need to understand for this project or will I
> >>>>> learn
> >>>>>> as we go along?
> >>>>>>
> >>>>>> --
> >>>>>> Thanks
> >>>>>> Ujjwal
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks
> >>>>> Ujjwal
> >>>>> _______________________________________________
> >>>>> GSoC mailing list
> >>>>> GSoC at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/gsoc
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Thanks
> >>> Ujjwal
> >>>
> >>
> >>
> >>
> >> --
> >> Thanks
> >> Ujjwal
> >>
> >
> >
> >
> > --
> >
> > Francesco Strozzi
> > _______________________________________________
> > GSoC mailing list
> > GSoC at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/gsoc
>



-- 
Thanks
Ujjwal



More information about the GSoC mailing list