[Bioperl-l] Protein Records without Sequence

Hamish McWilliam hamish.mcwilliam at bioinfo-user.org.uk
Wed Jun 5 19:58:51 UTC 2013


Hi Warren,

This is due to a change to the way some entries are handled in the
latest RefSeq (protein) data. From the RefSeq release notes
(ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release59.txt):

---

[1] ATTENTION: anticipated change in release 60 or 61
The RefSeq project is planning a significant expansion of the
prokaryotic dataset.
Specifically, NCBI's prokaryotic genome annotation pipeline is
generating annotation for all
submitted prokaryotic genomes representing strains, disease outbreak
sequences, population
sequencing, and diversity studies. These genomes include both complete
genomes and WGS (draft)
genomes submitted as 500 or fewer contigs. There will be a significant
increase in the number
of prokaryotic genomes and proteins provided in the RefSeq release.

We will avoid providing redundant protein records by providing a
single protein record when
identical proteins can be annotated on more than one genome.

A new RefSeq protein accession prefix, WP_, will be used for these proteins.
The accession has the format:

  WP_ + 9 numerals + version number, e.g., WP_000000001.1

WP_ records will be 'protein-only' records.  When an identical protein
is annotated on more
than one bacterial genome record, the annotated CDS will point to the
same WP_ accession.

Thus, a given WP_ record may represent a protein found in more than
one strain or in more
than one bacterial 'species'.

A separate announcement with more details will be provided in the next
few weeks.

---

This is implemented using the GenBank 'CONTIG' field to point to the
actual sequence entry from the various annotation entries. This means
that the flat-file data no longer contains the sequence, but NCBI
Entrez can construct the sequence when it is requested (e.g. when you
ask for fasta), in a similar way to the handling of contig entries in
GenBank and RefSeq (nucleotide).

All the best,

Hamish

On 5 June 2013 19:16, Warren Gallin <wgallin at ualberta.ca> wrote:
> Hi,
>
> I am encountering a problem with a number of protein records.
>
> A HMMer search of the nr database returns a gi number and an associated sequence.
>
> When I use that gi number to try to retrieve the full GENBANK record, however, there is no sequence returned with the record.
>
> When I use the NCBI web interface and use that gi number the GENPEPT record returns with no sequence, but when I select fast format the sequence is returned.
>
> Examples of gi numbers for which this occurs are:
>
> 23099847
> 21224301
> 68536697
> 46580017
> 77359109
>
> Is this a flaw with the individual GENPEPT records?  In which case should I report it to NCBI?
>
> Or are these some kind of "special" record that needs different parameters passed on the utilizes search?
>
> There is a workaround, I guess, where is the sequence comes back empty then a new retrieval of fasta formatted records can be run and the empty field in the GENPEPT record repopulated, but this seems inelegant.
>
> All advice and/or commentary appreciated.
>
> Warren Gallin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



-- 
----
"Saying the internet has changed dramatically over the last five years
is cliché – the internet is always changing dramatically" - Craig
Labovitz, Arbor Networks.




More information about the Bioperl-l mailing list