[GSoC] GSoC 2014 queries and inputs

Fields, Christopher J cjfields at illinois.edu
Wed Mar 19 14:26:43 UTC 2014


On Mar 19, 2014, at 8:28 AM, Artem Tarasov <lomereiter at gmail.com<mailto:lomereiter at gmail.com>> wrote:

On Tue, Mar 18, 2014 at 11:44 PM, Ujjwal Thaakar <ujjwalthaakar at gmail.com<mailto:ujjwalthaakar at gmail.com>> wrote:
What's the difference between SAM and VCF?

SAM: mapping software aligns reads against the reference genome (and its reverse-complement) and writes to SAM/BAM file information about best alignment of each read (to which strand it aligned, what are the differences compared to the reference, and so on)

VCF: not reads but positions on the reference genome are considered, and each record contains information about whether there's variability at a position. They are produced from SAM files by considering reads overlapping each position - if statistically significant number of reads have a base different from the reference (or an insertion/deletion), this is probably a true mutation which might have biological significance as well.

For JRuby, I'd recommend using Picard. No need to reinvent the wheel. Plus, you might also want to support the binary counterpart, BCF format.


--
Artem

Yep, if planning on going through jvm then Picard is nice and supports VCF (and BCF it seems).  No CRAM support, but there is this:

   https://github.com/enasequence/cramtools

(section on picard integration)

chris




More information about the GSoC mailing list