[Biopython-dev] GSoC python variant update 2

Mon May 21 01:35:08 UTC 2012

Lenna and Reece;

>> The reason I suggested a new representation class is so data from all
>> parsers can be stored in the same way.
>
> Lenna makes a very sound point. A Variant class should be able to represent
> all variant types, and therefore represent *only* the salient features of a
> generalized variant. It should not be specific to a particular format.

I'm in agreement with you. My thought process is along the lines of:
you'll help get to a general representation by exploring the
deficiencies of the more specific ones. I think it's hard to invent a
fully general scheme from outside.

> For instance, _Record expects a CHROM, but this immediately eliminates its
> use for transcript-based variants (NM or ENST). QUAL, FILTER, INFO, and
> FORMAT are not intrinsic properties of a variant. Don't get me wrong --
> it's exactly right for a *VCF* variant. However, _Record was never intended
> to be the variant abstraction that I think we should be aiming for at this
> time. Being VCF-specific isn't bad, but let's make sure the name accurately
> reflects the level of abstraction.

Also agreed, although you can fit a wide variety of things into this
general scheme. Ignoring all of the specific naming it's:

- the reference name (chromosome or space or contig or whatever you want to call it)
- position
- identifier
- ref/alt seqs (or pre/post)
- key-value pairs associated with the variant
- genotypes associated with the variant (also with key-value pairs)

The real different between this and your bio-hgvs-perl example is what
you expose as top level from the key-value pairs. VCF exposes QUAL and
FILTER (and I guess identifier too) while you had different choices that
were more right for your particular problem.

This is all brainstorming, rather than a specific suggestion. If I have
to think up something specific, I guess the right thing to do is make it
easy to built a custom object representation that makes coding easy for
specific problem sets from the more generic key/value information.

> Has anyone ever polled to see what versions of python people are using? I
> wonder whether we should care about 2.6 even (never mind 2.5). My guess is
> that 2.5 and 2.6 are tails of the distribution (as is 3.0, but at least
> it's ascending). I would be content to focus exclusively on 2.7 and
> 3.0.

I'm agreed, although practically dropping 2.6 support in Biopython won't
happen for a while. Unless there are 2.7 features that we really need it
shouldn't be to hard to support both. I only miss the multiple context
manager support for with statements, and haven't let myself get hooked
on ordered dicts or dictionary comprehensions yet.

Brad