[Biopython] API support for finding polymorphisms?

Tue Oct 19 03:04:00 UTC 2010

----- Original Message -----

> Hi, Alex. Are you working from short read data? If so, what platform?
> In what format are the aligned data?

Hi Sean,

I'm actually working from yeast literature data released by Sanger:

http://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp.html

The raw data is available via ftp in several formats including FASTQ
and others, the PDF for more info:

http://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp_manual.pdf

The original data is a mixture of Solexa/Illumina and ABI, different
platforms for different yeast strains.

They include a Perl script (alicat.pl) that can parse some of the
alignments that they had performed already (including both sequence
alignments with errors as well as imputed sequences with errors
and missing data corrected).  I have been working with the imputed 
alignments as I didn't want to go all the way back and re-align from
scratch all the raw data.

I could probably hack the Perl script to do some of what I need (it
already has a facility to print out only polymorphic positions from
the imputed alignments), but I would like a more robust Python-based 
solution.  My first thought was to use the alicat.pl script to output
the alignments and the imputed sequences, convert them into full sequences
and then  use Python-based solution from there to identify and classify 
the individual polymorphisms.  

At the moment, I'm only interested in looking at a couple of specific
genes, so it's not a genome-wide survey (i.e. I only need to keep
one or two genes and alignments in memory at once), but I'd like the 
solution to generalizable, so I could specify any yeast gene in the SGD
and include polymorphisms in both promoters as well as coding regions.

Alex