[GSoC] GSoC 2014 queries and inputs

Ujjwal Thaakar ujjwalthaakar at gmail.com
Tue Mar 18 19:44:01 UTC 2014


What's the difference between SAM and VCF?


On 18 March 2014 21:30, Ujjwal Thaakar <ujjwalthaakar at gmail.com> wrote:

> I think you're right. It makes more sense to focus first on the functional
> requirements of the project and then move onto implementing a cross
> platform api by writing a custom parser. It's tempting to write bindings to
> an existing C library but the costs involved need to be evaluated.
>
>
> On 18 March 2014 17:38, Fields, Christopher J <cjfields at illinois.edu>wrote:
>
>> Htslib also had vcf support.  One advantage there might be that
>> additional format support could be added at some point.  Not sure how the
>> community at large views it though...
>>
>> Chris
>>
>> Sent from my iPad
>>
>> > On Mar 18, 2014, at 4:31 AM, "Francesco Strozzi" <
>> francesco.strozzi at gmail.com> wrote:
>> >
>> > Hi Ujjwal,
>> > consider that BioRuby itself is only 100% compatible with CRuby and
>> almost
>> > fully compatible with JRuby (there are few libraries which do not work).
>> > The idea here should be to provide a higher interface to manage and
>> query
>> > VCF data and so my advise is to try not to spend too much time on
>> parsing
>> > issues and instead reuse existing code and libraries. I think we can
>> live
>> > with a JRuby only implementation, since you also proposed to use Neo4J
>> and
>> > the possibility to pack everything in a jar may sound tempting in the
>> end
>> > :).
>> > But if you would like to implement something that can work across
>> multiple
>> > Ruby implementations I think there are two ways:
>> > 1) you can write a simple parser in plain Ruby, VCF are just TSV files
>> so
>> > it's pretty straight forward. But implementing a solid parser which can
>> > handle every aspect of the information stored in VCF files still will
>> > require some time and testing.
>> > 2) you can look at existing C libraries and write a binding using the
>> Ruby
>> > FFI. This extension will be usable both by CRuby and JRuby. If this
>> sounds
>> > interesting, I will suggest looking into VCFLIB (
>> > https://github.com/ekg/vcflib).
>> >
>> > In the end these options may sound like GSoC projects on their own, so
>> if
>> > you would like to follow one or the other, I suggest you to try and
>> balance
>> > this work with the rest of the things to do on the project, to build a
>> > solid work plan.
>> >
>> > All the best.
>> > Francesco
>> >
>> >
>> > On Mon, Mar 17, 2014 at 9:37 PM, Ujjwal Thaakar <
>> ujjwalthaakar at gmail.com>wrote:
>> >
>> >> If its fine to have a JRuby only implementation then we definitely
>> write a
>> >> thin wrapper over Picard
>> >>
>> >>
>> >>> On 18 March 2014 01:56, Ujjwal Thaakar <ujjwalthaakar at gmail.com>
>> wrote:
>> >>>
>> >>> When we say BioRuby I think it should work with Ruby - CRuby, JRuby,
>> >>> Rubinius etc. I'm not sure it's a good idea to constrain people to
>> JRuby!
>> >>>
>> >>>
>> >>> On 18 March 2014 01:48, Francesco Strozzi <
>> francesco.strozzi at gmail.com>wrote:
>> >>>
>> >>>> I don't think it's necessary.  If you would like to use JRuby, there
>> is
>> >>>> the Picard API ( http://picard.sourceforge.net ) which you can reuse
>> >>>> right away. It's fast and well tested.
>> >>>>
>> >>>> All the best.
>> >>>> Francesco
>> >>>> Il 17/mar/2014 20:38 "Ujjwal Thaakar" <ujjwalthaakar at gmail.com> ha
>> >>>> scritto:
>> >>>>
>> >>>>> Would we have to write a new VCF parser in Ruby?
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>> On 15 March 2014 17:33, Ujjwal Thaakar <ujjwalthaakar at gmail.com>
>> wrote:
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>> My name is Ujjwal. I'm a 21 years old student from India and
>> >>>>> interested in
>> >>>>>> contributing to Bioruby this year. I have certain queries regarding
>> >>>>> the
>> >>>>>> project idea listed.
>> >>>>>>
>> >>>>>>   1. Can you give me some more use cases for this tool. Some
>> specific
>> >>>>>
>> >>>>>>   functional requirements we'd like to see. What we need to mine
>> >>>>> determines
>> >>>>>>   the data structure of our persistence layer and therefore which
>> >>>>> database
>> >>>>>>   engine to use.
>> >>>>>>   2. When you say a RESTful api, we want to deploy this on a server
>> >>>>> with
>> >>>>>
>> >>>>>>   a backing database together with a ruby gem that communicates
>> with
>> >>>>> the api
>> >>>>>>   right? And I presume we also want people to be able to make
>> >>>>> comparison of
>> >>>>>>   our hosted VFC files with their local VCF files
>> >>>>>>   3. Although this is a *Bioruby* project, the server doesn't
>> >>>>>
>> >>>>>>   necessarily need to be written in Ruby I presume? As is
>> mentioned,
>> >>>>> Scala or
>> >>>>>>   JRuby could be used. I would suggest we have a look at Go lang
>> too.
>> >>>>>>
>> >>>>>> To give you a background about me. I was a GSoC intern last year
>> for
>> >>>>> Ruby
>> >>>>>> on Rails where I implemented a RESTful collection routing api. I
>> am an
>> >>>>>> intermediate ruby programmer. I have also been interested in
>> synthetic
>> >>>>>> biology for about a year now and have some lab experience too so I
>> >>>>>> understand the basics of biology and specifically genetic
>> >>>>> engineering. I am
>> >>>>>> a computer science undergrad and have taken a course on data
>> >>>>> engineering
>> >>>>>> too. I also have experience working with REST apis and am building
>> one
>> >>>>>> right now for my startup.
>> >>>>>>
>> >>>>>> I have been wondering on the database. I think Neo4J will be a
>> great
>> >>>>> fit.
>> >>>>>> It's not heavy like oracle and does not need installation. It's
>> >>>>> portable
>> >>>>>> and can be started and stopped easily on the machine. Has low
>> memory
>> >>>>>> footprint and support for SPARQL too although it's native query
>> >>>>> language
>> >>>>>> Cypher will do the trick for us right now. We can run embedded
>> >>>>> instances
>> >>>>>> too using JRuby which are super fast. I'm the maintainer of the
>> most
>> >>>>>> popular Neo4j ruby bindings and also in the process of rewriting
>> the
>> >>>>> next
>> >>>>>> version of neo4j-core. It will allow us to make all sorts of
>> queries
>> >>>>> and do
>> >>>>>> data mining at an incredible speed while being incredibly portable
>> and
>> >>>>>> light. All logic can then reside within the gem itself and we do
>> not
>> >>>>> need
>> >>>>>> any backend. It should be fast enough since we'll be directly
>> dealing
>> >>>>> with
>> >>>>>> java objects made available through jruby. I have a fair idea of
>> how
>> >>>>> fast
>> >>>>>> this is and its really fast although working with such huge files
>> >>>>> will have
>> >>>>>> different challenges. We don't need a database for the embedded
>> >>>>> version.
>> >>>>>> All we need is jars which fortunately are available as a gem so all
>> >>>>> we have
>> >>>>>> to do is include them as dependencies and our database is ready! I
>> >>>>> don't
>> >>>>>> think it will be this easy for any other db while giving us the
>> same
>> >>>>> speed,
>> >>>>>> power and capabilities!
>> >>>>>>
>> >>>>>> I've started working on the proposal and will upload it in a
>> couple of
>> >>>>>> days for your feedback. This is going to be incredibly fun :)
>> >>>>>>
>> >>>>>> BTW what is the user base of bioruby like? What does it lack from
>> >>>>> other
>> >>>>>> bio libraries like biopython?
>> >>>>>>
>> >>>>>> How much biology do I need to understand for this project or will I
>> >>>>> learn
>> >>>>>> as we go along?
>> >>>>>>
>> >>>>>> --
>> >>>>>> Thanks
>> >>>>>> Ujjwal
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Thanks
>> >>>>> Ujjwal
>> >>>>> _______________________________________________
>> >>>>> GSoC mailing list
>> >>>>> GSoC at lists.open-bio.org
>> >>>>> http://lists.open-bio.org/mailman/listinfo/gsoc
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> Thanks
>> >>> Ujjwal
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks
>> >> Ujjwal
>> >>
>> >
>> >
>> >
>> > --
>> >
>> > Francesco Strozzi
>> > _______________________________________________
>> > GSoC mailing list
>> > GSoC at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/gsoc
>>
>
>
>
> --
> Thanks
> Ujjwal
>



-- 
Thanks
Ujjwal



More information about the GSoC mailing list