[GSoC] [Biopython-dev] GSoC python variant update 4

James Casbon casbon at gmail.com
Wed Jun 6 09:39:12 UTC 2012


I'd be cautious about going for SQL for VCF backends.  At least the
following two problems arise:

1. VCF isn't a format, it's a meta-format so there isn't really a single
data representation, but many.  You are going to need a very flexible
schema to allow variable records with complex entries like lists.  (An
entry is dynamically defined by the FORMAT field in each row, right?).
 Having a JSON misc entry means you lose all query abilities on these data
anyway.

2. If you move your data away from VCF, you cannot use tools from outside
your universe.  i.e. lets say you want to use a GATK variant annotator, you
need to do the roundtrip from SQL->VCF->SQL.

I speak having developed this approach already and largely abandoned it due
to the problems above.

You are right that SQL would be a better solution for data index and access
(no serialization issues, multiple tuned indexes), but be careful that you
may spend a lot of time and not have a lot to show.  I would really like it
if biology used existing binary formats (HDF5 anyone?), but we don't.  More
practical use right now would be bcf support.



More information about the GSoC mailing list