[GSoC] GSoC python variant update 8
Lenna Peterson
arklenna at gmail.com
Thu Jul 26 22:30:35 UTC 2012
Link: http://arklenna.tumblr.com/post/28082157403/
Post:
I previously proposed the implementation of a method for PyVCF that
would quickly scan the entire file and provide useful summary
statistics. The idea is shamelessly copied from Brad's GFF parser (see
https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this
method is helpful because the annotations on a sequence can vary
widely. However, I no longer think this would be useful for VCF:
1. Most importantly, the VCF headers generally contain a complete
listing of all of the types of information contained in the file. It's
technically optional, but I hope that the most commonly used variant
callers produce accurate headers. However, if there is a prevalence of
files with a mismatch between headers and actual INFO/FORMAT fields,
please let me know.
2. Next, any listing of ranges of data such as POS or QUAL might as
well be coupled with actual filtering. This would be different if a
presentation of the distribution of quality scores would be necessary
to set an appropriate threshold. It would also depend on the ratio of
speed between the range scan and the filtering (i.e. whether a
possible second filter would be unacceptably time consuming).
3. Finally, and perhaps most importantly, many files are so large that
scanning an entire file would take too long. Setting a limit and
displaying updated information in real time (i.e. writing to
`sys.stdout` with '\r', https://gist.github.com/3161269 ) could
overcome this issue.
If any VCF users can think of a great reason to scan a VCF file before
filtering it, please get in touch.
-------
I added the method `as_SeqFeature()` to my basic variant class, but
it's still incomplete. Some of this is in flux due to forthcoming
changes to FeatureLocation.
I'm currently working on expanding the coordinate mapper Reece posted
to the dev list a couple years ago (see
http://biopython.org/pipermail/biopython/2010-June/006598.html ).
Expect an update on that very soon.
Best,
Lenna
More information about the GSoC
mailing list