[Bioperl-l] Protein Records without Sequence

Wed Jun 5 22:10:51 UTC 2013

OK, I see where this is coming from.  If I get a record without the protein sequence I can evaluate it and then retrieve again as a fast file and put it into the Bio::Seq object.

Do you have any idea how they are going to handle this one-to-many mapping?  Given that a single protein sequence may be linked to multiple different nucleotide sequences, even from different species, that means that a single protein sequence record may be tied to several different nucleotide sequence records.

However, when I look up a couple of WP_XXXXXX records, I get the protein sequence but there is no DBSOURCE field, and there is no "coded_by"tag in the feature table so there is no direct way to find the underlying nucleotide sequence(s).

See WP_004062662.1  GI:490164010 as an example.

Is there any current way of retrieving the coding sequence starting from a protein record like this?

Warren Gallin

I am hoping that all of these fields that will now have multiple entries are going to be easily 
On 2013-06-05, at 1:58 PM, Hamish McWilliam <hamish.mcwilliam at bioinfo-user.org.uk> wrote:

> Hi Warren,
> 
> This is due to a change to the way some entries are handled in the
> latest RefSeq (protein) data. From the RefSeq release notes
> (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release59.txt):
> 
> ---
> 
> [1] ATTENTION: anticipated change in release 60 or 61
> The RefSeq project is planning a significant expansion of the
> prokaryotic dataset.
> Specifically, NCBI's prokaryotic genome annotation pipeline is
> generating annotation for all
> submitted prokaryotic genomes representing strains, disease outbreak
> sequences, population
> sequencing, and diversity studies. These genomes include both complete
> genomes and WGS (draft)
> genomes submitted as 500 or fewer contigs. There will be a significant
> increase in the number
> of prokaryotic genomes and proteins provided in the RefSeq release.
> 
> We will avoid providing redundant protein records by providing a
> single protein record when
> identical proteins can be annotated on more than one genome.
> 
> A new RefSeq protein accession prefix, WP_, will be used for these proteins.
> The accession has the format:
> 
>  WP_ + 9 numerals + version number, e.g., WP_000000001.1
> 
> WP_ records will be 'protein-only' records.  When an identical protein
> is annotated on more
> than one bacterial genome record, the annotated CDS will point to the
> same WP_ accession.
> 
> Thus, a given WP_ record may represent a protein found in more than
> one strain or in more
> than one bacterial 'species'.
> 
> A separate announcement with more details will be provided in the next
> few weeks.
> 
> ---
> 
> This is implemented using the GenBank 'CONTIG' field to point to the
> actual sequence entry from the various annotation entries. This means
> that the flat-file data no longer contains the sequence, but NCBI
> Entrez can construct the sequence when it is requested (e.g. when you
> ask for fasta), in a similar way to the handling of contig entries in
> GenBank and RefSeq (nucleotide).
> 
> All the best,
> 
> Hamish
> 
> On 5 June 2013 19:16, Warren Gallin <wgallin at ualberta.ca> wrote:
>> Hi,
>> 
>> I am encountering a problem with a number of protein records.
>> 
>> A HMMer search of the nr database returns a gi number and an associated sequence.
>> 
>> When I use that gi number to try to retrieve the full GENBANK record, however, there is no sequence returned with the record.
>> 
>> When I use the NCBI web interface and use that gi number the GENPEPT record returns with no sequence, but when I select fast format the sequence is returned.
>> 
>> Examples of gi numbers for which this occurs are:
>> 
>> 23099847
>> 21224301
>> 68536697
>> 46580017
>> 77359109
>> 
>> Is this a flaw with the individual GENPEPT records?  In which case should I report it to NCBI?
>> 
>> Or are these some kind of "special" record that needs different parameters passed on the utilizes search?
>> 
>> There is a workaround, I guess, where is the sequence comes back empty then a new retrieval of fasta formatted records can be run and the empty field in the GENPEPT record repopulated, but this seems inelegant.
>> 
>> All advice and/or commentary appreciated.
>> 
>> Warren Gallin
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> 
> -- 
> ----
> "Saying the internet has changed dramatically over the last five years
> is cliché – the internet is always changing dramatically" - Craig
> Labovitz, Arbor Networks.
>