[Bioperl-l] Finding coding sequences corresponding to protein sequences

Mon Jan 22 18:46:13 UTC 2007

I have a set of gi numbers for protein sequences and I am trying to  
write a script that will return the corresponding nucleic acid coding  
sequences (CDS only) for each protein record.

Currently the script grabs the protein sequence SeqIO object from  
GENPEPT and scans through the primary tags looking for CDS.

Problem number 1 is that not all of the protein records have a CDS  
annotation that links to a nucleic acid sequence record (the first  
example in my set of records is gi1345813).  So, when that happens,  
is there a straightforward way to find all protein records that have  
amino acid sequences identical to the current one, aside from running  
BLAST and then parsing the sequence identifiers on the alignment of  
the first hit?

Problem number 2 is that sometimes the accession number for the  
corresponding nucleotide sequence in the CDS-tagged feature is a  
RefSeq entry, which causes my script to throw a fatal exception (the  
first example in my set of records is gi31543024).  Is there a  
straightforward way to get to a non-RefSeq record given a RefSeq  
accession number?  Also, is there some way of evaluating the call to  
GENBANK to determine if the record being requested is a RefSeq record  
and bypass the call?

I would appreciate pointers to relevant documentation or examples so  
I can learn more about the general issues that are involved, or  
specific advice on how to make this kind of a script work.

I'm using Bioperl 1.5.2 on a OS X server.

Thanks,

Warren Gallin