[Biopython] Samtools Pileup format - NGS data

Wed Oct 20 16:44:38 UTC 2010

On Wed, Oct 20, 2010 at 12:16 PM, Adrian Johnson
<oriolebaltimore at gmail.com>wrote:

> Dear group,
>
> I am wondering about any functionality in BioPython that deals with
> annotation of SNPs identified through NGS pipelines.
>
> For instance if given a Pileup format :
>
> chr1    799195  *       */+G    115     115     33      37      *       +G
> chr1    811750  a       G       36      36      60      3       Ggg     AB?
> chr1    815761  C       A       2       33      46      3       A.a     CCC
> chr1    815777  C       T       2       33      46      3       T.t     CCC
>
>
> Now it would be very interesting to have a module that connects to
> NCBI or UCSC servers and compute the following questions:
>
> 1. Identify what mutation type at a given position on a chromosome (
> 815777@ chr1). The mutation could be a synonymous, frame-shift etc.
>
> 2. Get gene name, accession and protein accession.
>
> 3. Get the type of amino-acid change such as Gly -> Ser
>
> 4. If this SNP is observed in dbSNP, 1000 genomes data and other
> mutation databases.
>
> 5. Get the allele frequencies from dbSNP for this SNP if found in dbSNP
>
> 6. Location of the SNP - viz.  intron, 5'UTR, 3'UTR or splice site.
>
>
> A web service from Shedure lab is available for this type of
> questions. Given MAQ or Pileup format, this website reports answers to
> all the questions above.  However, the website is slow and cannot be
> used in a pipeline.
>
> Any BioPython user or developer working on this kind of functionality?
>
>
Hi, Adrian.  You might look at the SIFT application.  It can be downloaded
and includes precomputed results for 1,2,3, and dbSNP part of 4 as several
sqlite database files.  We dump those databases out and use the text files
directly.  With BEDtools (and there python libraries like bxPython with
similar functionality), number 6 is also quite straightforward (single
command line, basically), also.  If you have other tab-delimited text files
with genomic things of interest, consider using tabix (from the samtools
site) to index the compressed, sorted files.  tabix includes a python
wrapper that allows nearly instantaneous overlap queries and returns rows
from the text file.

Sean