[Bioperl-l] how to get the protein sequences from DNA sequences around novel SNPs?

Tue Nov 10 04:58:32 UTC 2009

On Nov 9, 2009, at 3:15 PM, Robert Bradbury wrote:

> On Mon, Nov 9, 2009 at 1:08 PM, Guangchun Song <gc11song at gmail.com>  
> wrote:
>>
>> I'm new bioperl user.  I' working on a project: To determine the
>> status of all tutative SNPs such as non-synonymous vs. synonymous,  
>> and
>> predict the tranlational effect of non-synonymous mutations as benign
>> or malicious.  I'm trying to use bioperl to get the DNA sequence and
>> translate to protein sequence for the SNPs that are in gene's coding
>> region.  Could someone tell me how to do it?
>>
>>
> I too would like to know if this information is available.  I've  
> recently
> been working with the dbSNP results from NCBI but they display the  
> results
> in a graphical format rather than data that one can play with and ask
> questions of like "What is the most disease causing gene in the Human
> Genome?" or "What are the critical proteins damaged by gene defects  
> in the
> Human Genome?" ... "In terms of premature deaths, extended health care
> requirements, loss of quality of life, etc.?"
>
> The same types of questions can be applied to the dog and cat  
> genomes where
> there is emotional value or the cow, horse, pig, etc. genomes where  
> there is
> economic value?
>
> The value of BioPerl would increase significantly if there were
> functionality that would allow easy access to "these mutations may  
> have
> negative/positive impact" (which means you need a function that  
> qualifies
> mutations by degree) and allow for impact to be subjectively  
> determined
> (implying there must be some callback function to provide a user
> quality/impact rating).
>
> For example:
>   $/@differences =  protein_compare($mygene, $refseq_gene,  
> @critical_aa,
> @critical_domain, $callback)
> Where $callback could "rate" differences about the protein and  
> position and
> the "type of interest" (e.g. metal binding amino acids, structural  
> changing
> amino acids, critical catalysis amino acids, etc.).
>
> A default callback would be based on some evolving definition of  
> "critical"
> changes which result in human disease for example.
>
> This is a "required" capability to be able to determine things like  
> the
> "adaptability" of a species -- those with fewest critical mutation  
> points
> may have better adaptability to mutation increasing circumstances.
>
> Please pardon any errors in perl syntax/usage its been a while since  
> I've
> written perl and I'd really rather be coding in C.
>
> Robert

I will say that most of the information from the SNP database is  
available in various formats (see following link under 'Retrieval  
Types'):

http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html

You can access this information, as well as the full XML, using  
something like the following script.

chris

------------------------------------------------

#!/usr/bin/perl -w

use 5.010;
use strict;
use warnings;
use Bio::DB::EUtilities;

my $term = shift;
my $eutil  = Bio::DB::EUtilities->new(-eutil    => 'esearch',
                                       -db       => 'snp',
                                       -term     => $term,
                                       -usehistory => 'y',
                                       -retmax   => 100);

my $hist = $eutil->next_History || die "No history returned";

# for SNP XML, change retmode to 'xml'
$eutil->set_parameters(-eutil   => 'efetch',
                        -history => $hist,
                        -retmode => 'text',
                        -rettype => 'flt');

# dumps to STDOUT
say $eutil->get_Response->content;