[GSoC] GSoC 2014 BioRuby

Francesco Strozzi francesco.strozzi at gmail.com
Mon Mar 17 09:14:08 UTC 2014


Hi Razvan,
have a look at the org.broadinstitute.variant.vcf
and org.broadinstitute.variant.variantcontext.VariantContext classes within
the Picard API. Those are used to read from a VCF file, while to write a
VCF you need to use also
the org.broadinstitute.variant.variantcontext.writer .

Hope this can help a bit, docs are not incredibly helpful here to point out
what every library does and you need to dig a bit on Google as well :-)

All the best.
Francesco



On Sat, Mar 15, 2014 at 11:16 PM, Razvan Florea
<razvan.florea91 at gmail.com>wrote:

> Hi Francesco,
>
> I am trying to make that wrapper for Picard as you recommend me.
> I created a repository on github at [1]. Right now in this repository is a
> jruby simple script that uses a class from Picard that converts between
> "vcf" and "bcf" files.
>
> I didn't find classes for retrieving SNPs from VCF files. Can you help me
> please with some information about that?
>
> [1] https://github.com/razvanflorea/picard-jruby-wrapper
>
> Best,
> Razvan
>
>
> 2014-03-15 10:17 GMT+01:00 Francesco Strozzi <francesco.strozzi at gmail.com>
> :
>
> Hi Razvan,
>>
>> 1) I think having a client would be nice of course but I would not
>> consider it critical. Building a client around a REST API is pretty
>> straight forward in any language.
>>
>> 2) Yes of course, look also at the Picard (http://picard.sourceforge.net/)
>> library. This is the low level API to access VCF and other files and GATK
>> relies heavily on this to fetch the data out of raw files.
>>
>> 3) If you have some code on GitHub or other repo that you would like to
>> show us, that's fine. Otherwise you could spend a bit of time writing a
>> simple JRuby wrapper for Picard, to access a VCF file and retrieve a list
>> of SNPs. This could be like a pet project to start wrapping your head
>> around these libraries, while spending also some time with JRuby as well.
>>
>> All the best.
>> Francesco
>>
>>
>>
>>
>> On Fri, Mar 14, 2014 at 6:50 PM, Razvan Florea <razvan.florea91 at gmail.com
>> > wrote:
>>
>>> Hello Francesco,
>>>
>>> 1. The queries will be made through http requests (basically GET and
>>> POST). But does the project consist also of making a client for the web
>>> service?
>>> 2. I think using the GATK framework is absolutely necessary because even
>>> we will choose to use a database engine, the VCF files have to be migrated
>>> to the database which I think can be made with this framework. Am I right?
>>> 3. Meanwhile, do you think I can contribute somehow to show my skills
>>> and my willing to work on this project this summer?
>>>
>>> Best,
>>> Razvan
>>>
>>>
>>> 2014-03-14 14:43 GMT+01:00 Francesco Strozzi <
>>> francesco.strozzi at gmail.com>:
>>>
>>> Hi Razvan,
>>>> the general idea is to try and have an interface which lets you do
>>>> queries on top of the data stored into VCF files.
>>>> For example, as a typical scenario one could ask to retrieve all the
>>>> variations which are exclusively present into 20 samples out of a dataset
>>>> of 100 samples.
>>>> An API could then expose a method which take a list of samples names
>>>> plus other conditions and returns for instance a json with all the
>>>> variations fulfilling the query.
>>>>
>>>> Whether a database engine is to be used or not it may depend on how you
>>>> would like to implement the whole thing. One can also imagine not to store
>>>> anything into a database and just access the data from the VCF files but
>>>> providing a higher level interface. In this case I'd suggest to you and to
>>>> other students interested in the topic to explore also the GATK framework (
>>>> https://github.com/broadgsa/gatk,
>>>> http://www.broadinstitute.org/gatk/guide/topic?name=developer-zone)
>>>> since it exposes a number of modules called walkers that should make the
>>>> life easier in accessing and traversing VCF files.
>>>>
>>>> JRuby sounds about right, as you'll have the typical Ruby flexibility
>>>> to quickly prototype new things while having the ability to include Java
>>>> code (GATK is written in Java and Scala BTW).
>>>>
>>>> Cheers
>>>> Francesco
>>>>
>>>>
>>>> On Thu, Mar 13, 2014 at 9:58 AM, Razvan Florea <
>>>> razvan.florea91 at gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> My name is Razvan Florea and am studying Computing Science at
>>>>> University of
>>>>> Groningen, Netherlands.
>>>>> I am writing this to show my interest for the BioRuby gsoc project: "An
>>>>> ultra-fast scalable RESTful API to query large numbers of genomic
>>>>> variations".  Currently I am doing my bachelor thesis project which is
>>>>> also
>>>>> about developing a RESTful API.
>>>>>
>>>>> As Francesco recommand me I took a look on the links there are in the
>>>>> proposal text and at the proposal itself and so far I understood that
>>>>> the
>>>>> basic idea of the project is to replace the manipulation of information
>>>>> from VCF files with manipulation of information from a database which
>>>>> will
>>>>> reside on an web service. Am I right?
>>>>> If yes, what do you expect from the API to be capable to do? Retrieving
>>>>> "json"s with information is ok? Or is more than that?
>>>>>
>>>>> Also, Rails over JRuby could be a good choice of technology for
>>>>> developing
>>>>> the web service?
>>>>>
>>>>> Please give me any information you think it could be helpful for me.
>>>>>
>>>>> Thank you,
>>>>> Razvan
>>>>> _______________________________________________
>>>>> GSoC mailing list
>>>>> GSoC at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/gsoc
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Francesco Strozzi
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Francesco Strozzi
>>
>
>


-- 

Francesco Strozzi



More information about the GSoC mailing list