[BioRuby] GSOC

Pjotr Prins pjotr.public14 at thebird.nl
Mon May 12 10:00:28 UTC 2014


Not completely accurate. Variant callers do take liberties. I have
just had a varscan2 result which was rejected by cartagenia. Obviously
the latter is not flexible in what it accepts ;). Turns out it does
not allow the variant field to contain multiple nucleotides.

My main point is that, if you want your software to be generally
useful, you can not predict what liberties programmers take. That is
in fact the secret of the success of 'flexible' formats in
bioinformatics - think VCF, FASTA, GFF3, SAM etc. The trick is to have
minimal guidelines on what you expect - but don't become rigorous or,
if you can't resist being rigorous, make it so that you can switch it
off. With bio-vcf I have added a --ignore-errors option for that very
reason.

Pj.

On Mon, May 12, 2014 at 12:01:09PM +0200, Francesco Strozzi wrote:
> Hi Loris,
> *if* the VCF file is generated following general rules and guidelines, what may
> change is the presence / absence of keys in the INFO and GENOTYPE fields.
> Normally variation callers provide information on the INFO field composition in
> the VCF header. We will provide you with VCF example files generated with the
> latest versions of the most used calling software, i.e. Samtools (v2.0-rc7),
> FreeBayes (v0.9.14) and GATK (v3.1) so you can have a look at differences.
> 
> Francesco
> 
> 
> On Mon, May 12, 2014 at 11:55 AM, Loris Cro <l.cro at campus.unimib.it> wrote:
> 
>     Pjotr pointed out in another discussion that VCF files have
>     some differences depending on what program generated them.
> 
>     How can I find out more about there differences? Or is it only a
>     matter of custom keys in the INFO and/or FORMAT field structure?
>     Basically I'm wondering if the VCF file specification is enough to
>     understand these differences.
> 
>     Also, since the objective is to work with multiple files, some fields
>     seem to lose meaning in that contest. Is there any convention
>     regarding that matter?
> 
> 
>     2014-04-29 19:46 GMT+02:00 Loris Cro <l.cro at campus.unimib.it>:
> 
>     > Hi all! Let me start the discussion with some info about what I've
>     > done, what I'm planning to do next and what questions I need help with.
>     >
>     >
>     > As far as I understand, the project idea published on the OBF
>     > wiki was primarily to answer the problem of computing privates.
>     > There are other features that you are interested in, but this
>     > specific problem was the biggest pain point. I say "was" because
>     > in fact computing privates is not that hard in the end (no JOINs or
>     > heavy denormalization required anymore) as you can see by
>     > reading:
>     >
>     >       https://gist.github.com/kappaloris/11356517
>     >
>     > In fact now it seems to me that this tool would be best implemented
>     > as a library (with support for all the features mentioned in the gist) to
>     > be used in conjunction with tabix. (If anyone wants to help me write it
>     > in python, I reserved the name PrivatePy on pypi :3. I don't want to
>     > commit this early to extra work so, if you like the idea, please offer
>     > some help, I still have a DBMS to think about :) If you want to write
>     > it in Ruby I can still help, ofc, no cool name tho).
>     >
>     > I prefere python because it seems to me that python is the language
>     > with the most educational value since the "private" concept is not
>     > private to biology alone and also it's the language I know best (and
>     > accordingly I can help most effectively with).
>     >
>     > Nevertheless, as I stated in my proposal, this script can also
>     > be implemented as a processing step during the import to the DBMS.
>     > Unless you're working with really huge amounts of data, you shouldn't
>     > expect the DBMS to be faster than a command-line utility, tho.
>     >
>     > Now, what I want to understand is how exactly VCF files constitute
>     > a bottleneck:
>     >
>     > 1. Regarding performance: are there other computationally heavy
>     >     operations (like privates once were :D )? Mixing filtering and
>     other
>     >     "by row" rules doesn't really count as 'heavy', I'm talking about
>     ugly
>     >     cross-referencing business.
>     >
>     > 2. Regarding current cases: what operations are really easy but made
>     >     tedious by lack of proper interfaces / inconsistent formats / ...
>     that
>     >     this system should be expected to offer? An example would be the
>     >     possibility of doing a "walk-together" import of multiple VCF
>     files.
>     >     This would also be extremely beneficial for making private-indexing
>     >     faster.
>     >
>     > 3. Regarding new cases: what new features should be considered a
>     >     must-have? For example, 1-click scalability? If you don't have
>     >     already a specific idea, don't worry, as the other details fall
>     into
>     >     place I will offer some ideas depending on what the most plausible
>     >     solutions might offer.
>     >
>     > 4. Regarding ???: are there any other aspects that I'm missing?
>     >
>     >
>     > Please note that [1] is what i understand best. [2] Especially is not
>     > easy for me: I don't work in a sequencing center so if you want to
>     > point something out please add a little context and don't be afraid
>     > to paste some example code of what you are doing now and how
>     > you think it should be done if you had this system already available.
>     >
>     > As of now, other than do some more exploring on my solution for
>     > computing privates, this is the first hurdle to jump. Talking with
>     > Francesco, it seemed that privates where the biggest computational
>     > problem, meaning that, unless someone points to something that
>     > I'm missing, the focus should be more on ease of use and less on
>     > "raw" performance (since every DBMS has its own kinks).
>     >
>     > Next week I will start writing the blog and publish some other
>     > information about what I'm doing to make easier following the
>     > development.
>     >
>     >
>     >
>     > Thank you all for your time and please don't bother wasting time
>     > on courtesy/etiquette: if you have any objections share them as
>     > soon as possible, I don't mind criticism and "form" is just a prefix
>     > of formalism.
>     >
>     >
>     >
>     >
>     > 2014-04-27 8:02 GMT+02:00 Pjotr Prins <pjotr.public14 at thebird.nl>:
>     >
>     > Hi Lori,
>     >>
>     >> Congrats from BioRuby for being accepted as a GSoC student this year!
>     >>
>     >> To all others: Lori is going to work on a scalable NoSQL VCF container
>     >> for BioRuby with Francesco as a primary mentor. I am pretty excited
>     >> about this project - VCF parsing is quite a bottleneck in many
>     >> sequencing centers (including ours) and a NoSQL solution may just be
>     >> the right idea.
>     >>
>     >> From here on we will discuss this project on this ML.
>     >>
>     >> Lori, I am on IRC this morning, if you like to chat.
>     >>
>     >> Pj.
>     >>
>     >
>     >
>     _______________________________________________
>     BioRuby Project - http://www.bioruby.org/
>     BioRuby mailing list
>     BioRuby at lists.open-bio.org
>     http://lists.open-bio.org/mailman/listinfo/bioruby
> 
> 
> 
> 
> --
> 
> Francesco Strozzi



More information about the BioRuby mailing list