[GSoC] [Biopython-dev] GSoC python variant update 8

Chris Mitchell chris.mit7 at gmail.com
Fri Jul 27 23:17:13 UTC 2012


Sorry for my brevity, but one great reason to scan a VCF file is to know
where your variants are for downstream analysis.  For instance, when
analyzing RNA-Seq data for features such as Allele Specific Expression,
having quick access to where variants are located is essential.

On Thu, Jul 26, 2012 at 6:30 PM, Lenna Peterson <arklenna at gmail.com> wrote:

> Link: http://arklenna.tumblr.com/post/28082157403/
>
> Post:
>
> I previously proposed the implementation of a method for PyVCF that
> would quickly scan the entire file and provide useful summary
> statistics. The idea is shamelessly copied from Brad's GFF parser (see
> https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this
> method is helpful because the annotations on a sequence can vary
> widely. However, I no longer think this would be useful for VCF:
>
> 1. Most importantly, the VCF headers generally contain a complete
> listing of all of the types of information contained in the file. It's
> technically optional, but I hope that the most commonly used variant
> callers produce accurate headers. However, if there is a prevalence of
> files with a mismatch between headers and actual INFO/FORMAT fields,
> please let me know.
>
> 2. Next, any listing of ranges of data such as POS or QUAL might as
> well be coupled with actual filtering. This would be different if a
> presentation of the distribution of quality scores would be necessary
> to set an appropriate threshold. It would also depend on the ratio of
> speed between the range scan and the filtering (i.e. whether a
> possible second filter would be unacceptably time consuming).
>
> 3. Finally, and perhaps most importantly, many files are so large that
> scanning an entire file would take too long. Setting a limit and
> displaying updated information in real time (i.e. writing to
> `sys.stdout` with '\r', https://gist.github.com/3161269 ) could
> overcome this issue.
>
> If any VCF users can think of a great reason to scan a VCF file before
> filtering it, please get in touch.
>
> -------
>
> I added the method `as_SeqFeature()` to my basic variant class, but
> it's still incomplete. Some of this is in flux due to forthcoming
> changes to FeatureLocation.
>
> I'm currently working on expanding the coordinate mapper Reece posted
> to the dev list a couple years ago (see
> http://biopython.org/pipermail/biopython/2010-June/006598.html ).
> Expect an update on that very soon.
>
> Best,
>
> Lenna
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>



More information about the GSoC mailing list