[GSoC] GSoC 2014 queries and inputs

Artem Tarasov lomereiter at gmail.com
Wed Mar 19 13:28:55 UTC 2014


On Tue, Mar 18, 2014 at 11:44 PM, Ujjwal Thaakar <ujjwalthaakar at gmail.com>wrote:

> What's the difference between SAM and VCF?


SAM: mapping software aligns reads against the reference genome (and its
reverse-complement) and writes to SAM/BAM file information about best
alignment of each read (to which strand it aligned, what are the
differences compared to the reference, and so on)

VCF: not reads but positions on the reference genome are considered, and
each record contains information about whether there's variability at a
position. They are produced from SAM files by considering reads overlapping
each position - if statistically significant number of reads have a base
different from the reference (or an insertion/deletion), this is probably a
true mutation which might have biological significance as well.

For JRuby, I'd recommend using Picard. No need to reinvent the wheel. Plus,
you might also want to support the binary counterpart, BCF format.


--
Artem



More information about the GSoC mailing list