[GSoC] GSoC python variant update 5
Lenna Peterson
arklenna at gmail.com
Mon Jun 18 04:21:42 UTC 2012
Latest post: http://arklenna.tumblr.com/post/25343434817/
James raised some
[concerns](http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009688.html)
about the difficulty of representing the VCF "metaformat" in SQL. I've
taken these into consideration and am forging ahead. So far, some of
the types of data fit more neatly into SQL than into a VCF row.
I have redesigned my SQL schema with a two-pronged approach to tackle
the flexibility of VCF:
1. For the site, alt, and genotype tables, there are columns for the
reserved info/format keywords in the VCF spec (so far only for
non-SV).
2. For new info and format keywords (both in the header and in the
body), I am storing the values in a "narrow table." This table stores
a foreign key to the key's row and the key-value pair. The narrow
table is also good for storing reserved keys that are lists (but not
per-allele or per-genotype).
Note: this diagram only has the FKs listed for simplicity.
(SQL diagram)
Interestingly, despite the increase in the number of tables and thus
insert statements, the current script is considerably faster than the
previous version. Evidently JSON serialization is slow.
There are a few things I haven't figured out:
1. Can an info field be per-genotype? The spec implies that wouldn't
make sense, but doesn't forbid it.
2. Is there a safe way to find out if a VCF 4.0 field is per-allele or
per-genotype?
3. Will my SQL representation be able to handle SV?
=======
I'll be out of town for the next week but I will have plenty of time for Python.
More information about the GSoC
mailing list