[BioRuby] GSOC

Mon May 12 19:48:44 UTC 2014

> I think it is better to stick to storing data in a row wise fashion (by
> variant, SNP, record).
>
> Queries are typically row based. Speedy parallel processing will be
possible
> when all rows are independent of each other.

That is what I meant in the example. The collection containing the
actual records, MYPIGS in this case, is composed of 1 object per record.
Each object is keyed by 'CHROM-POS' and contains the information you find
in a row of a VCF file, with some fields multiplexed in case of multiple
VCF files.
If new samples are added to a particular collection at a later point in
time, each
object is updated if the new data's coordinates coincide (meaning that
there will
always be only 1 object per chromosome-position).

VCF FILES and SAMPLES are separated tables/collections used to store
metadata.
Either you misread my example or I completely misunderstood your objection.

Regarding the ref, then I really have no idea what your use-case would be.
I've been asked to support the possibility of displaying the results of
queries
as a VCF file (somewhat mimicking vcf-merge), which seems an agreeable
feature. To do that I must keep all the data of each file to make sure the
merged solution makes sense (for example different files might have
different
quality or depth for the same record and the merged result should be a
user-defined
composition of each value).

Uhmm... while I was writing this reply I think I got why you don't care
about the
partial information that the REF field yields: unless you know all the
reference
genomes in their entirety you're going to miss all rows where each sample
has
the same genotype that its reference has (basically "missing" records)
potentially
resulting in a lot of false-negatives.

If that's the case then you're totally right.