[Bioperl-l] Getting coding sequence starting with a protein record

Tue Apr 15 18:39:53 UTC 2014

I am having a problem finding a general method of recovering the nucleotide coding sequence for a protein sequence record.

Generally tracking the CDS annotation back to the nucleotide sequence record using the accession number of the nucleotide sequence is working.

One problem arises when the underlying coding sequence is spliced from multiple nucleotide records.  Is there a general approach to automatically track down and joint the different sequence fragments from different sequence entries?  An example of the problem can be seen if you start from the protein record with GI number 7715882.  It is annotated as coming from three different nucleotide records.  Is there an approach in Bioperl that will detect and download these three records and splice together the appropriate parts to get the coding sequence?

The other problem that I am having is the ongoing issue of protein records annotated as highly redundant sequences , with WP-XXXXXX accession numbers.  Has anyone found a way to retrieve the set of different nucleotide sequences that all encode a single AP-annotated protein sequence?

Any help would be appreciated,

Warren Gallin