[GSoC] [Biopython-dev] GSoC python variant update 5

Brad Chapman chapmanb at 50mail.com
Tue Jun 19 00:28:11 UTC 2012


Lenna;
Thanks for the update. I've been following the commits on GitHub and
looks like you're getting some traction with the SQL representation. I
do worry about it for some of the same reasons as James but happy to
have you take a look if it helps with your understanding of VCF.

I think it might also be worth thinking of some use cases that are not
well covered with the current PyVCF parser and seeing if your
representation tackles them better. One current one that is tough is
slicing a VCF file by sample. Row based slicing is well supported but
column based is not as easy.

If I had a, say, 50 sample file: how well does it allow pulling out the
genotypes and records from a single sample and re-writing as VCF. Can
you code up this type of workflow with your current representation?

For your specific questions:

> 1. Can an info field be per-genotype? The spec implies that wouldn't
> make sense, but doesn't forbid it.

The INFO key/values are per-variant. There are also arbitrary
per-genotype key/values allowed, specified in the FORMAT file.

> 2. Is there a safe way to find out if a VCF 4.0 field is per-allele or
> per-genotype?

This should be the INFO/FORMAT distinction I described above.

> 3. Will my SQL representation be able to handle SV?

VCF encodes structural variation information into the INFO metadata, so
as long as you support the structural variant specified ALT fields it
should fit. The longer term question is if you want to support more
explicit linking between distant breakends, which would require special
support. I think that's probably more of an end-of-the-summer goal,
however, since most people aren't yet doing tons of VCF structural
variation work.

Brad



More information about the GSoC mailing list