[Bioperl-l] Protein Records without Sequence
Warren Gallin
wgallin at ualberta.ca
Wed Jun 5 22:10:51 UTC 2013
OK, I see where this is coming from. If I get a record without the protein sequence I can evaluate it and then retrieve again as a fast file and put it into the Bio::Seq object.
Do you have any idea how they are going to handle this one-to-many mapping? Given that a single protein sequence may be linked to multiple different nucleotide sequences, even from different species, that means that a single protein sequence record may be tied to several different nucleotide sequence records.
However, when I look up a couple of WP_XXXXXX records, I get the protein sequence but there is no DBSOURCE field, and there is no "coded_by"tag in the feature table so there is no direct way to find the underlying nucleotide sequence(s).
See WP_004062662.1 GI:490164010 as an example.
Is there any current way of retrieving the coding sequence starting from a protein record like this?
Warren Gallin
I am hoping that all of these fields that will now have multiple entries are going to be easily
On 2013-06-05, at 1:58 PM, Hamish McWilliam <hamish.mcwilliam at bioinfo-user.org.uk> wrote:
> Hi Warren,
>
> This is due to a change to the way some entries are handled in the
> latest RefSeq (protein) data. From the RefSeq release notes
> (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release59.txt):
>
> ---
>
> [1] ATTENTION: anticipated change in release 60 or 61
> The RefSeq project is planning a significant expansion of the
> prokaryotic dataset.
> Specifically, NCBI's prokaryotic genome annotation pipeline is
> generating annotation for all
> submitted prokaryotic genomes representing strains, disease outbreak
> sequences, population
> sequencing, and diversity studies. These genomes include both complete
> genomes and WGS (draft)
> genomes submitted as 500 or fewer contigs. There will be a significant
> increase in the number
> of prokaryotic genomes and proteins provided in the RefSeq release.
>
> We will avoid providing redundant protein records by providing a
> single protein record when
> identical proteins can be annotated on more than one genome.
>
> A new RefSeq protein accession prefix, WP_, will be used for these proteins.
> The accession has the format:
>
> WP_ + 9 numerals + version number, e.g., WP_000000001.1
>
> WP_ records will be 'protein-only' records. When an identical protein
> is annotated on more
> than one bacterial genome record, the annotated CDS will point to the
> same WP_ accession.
>
> Thus, a given WP_ record may represent a protein found in more than
> one strain or in more
> than one bacterial 'species'.
>
> A separate announcement with more details will be provided in the next
> few weeks.
>
> ---
>
> This is implemented using the GenBank 'CONTIG' field to point to the
> actual sequence entry from the various annotation entries. This means
> that the flat-file data no longer contains the sequence, but NCBI
> Entrez can construct the sequence when it is requested (e.g. when you
> ask for fasta), in a similar way to the handling of contig entries in
> GenBank and RefSeq (nucleotide).
>
> All the best,
>
> Hamish
>
> On 5 June 2013 19:16, Warren Gallin <wgallin at ualberta.ca> wrote:
>> Hi,
>>
>> I am encountering a problem with a number of protein records.
>>
>> A HMMer search of the nr database returns a gi number and an associated sequence.
>>
>> When I use that gi number to try to retrieve the full GENBANK record, however, there is no sequence returned with the record.
>>
>> When I use the NCBI web interface and use that gi number the GENPEPT record returns with no sequence, but when I select fast format the sequence is returned.
>>
>> Examples of gi numbers for which this occurs are:
>>
>> 23099847
>> 21224301
>> 68536697
>> 46580017
>> 77359109
>>
>> Is this a flaw with the individual GENPEPT records? In which case should I report it to NCBI?
>>
>> Or are these some kind of "special" record that needs different parameters passed on the utilizes search?
>>
>> There is a workaround, I guess, where is the sequence comes back empty then a new retrieval of fasta formatted records can be run and the empty field in the GENPEPT record repopulated, but this seems inelegant.
>>
>> All advice and/or commentary appreciated.
>>
>> Warren Gallin
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
> --
> ----
> "Saying the internet has changed dramatically over the last five years
> is cliché – the internet is always changing dramatically" - Craig
> Labovitz, Arbor Networks.
>
More information about the Bioperl-l
mailing list