[Biopython] Finding protein ID using Entrez.efetch

Peter biopython at maubp.freeserve.co.uk
Fri Aug 28 10:37:45 UTC 2009


> To: <biopython at biopython.org>
> Date: Fri, 28 Aug 2009 09:42:03 +0900 (KST)
> Subject: Finding protein ID using Entrez.efetch
>
> Hi all,
>
> I'm looking for the way to extract the data of protein ID numbers in
> the Genbank. I got my Genbank data and save it as a xml file using
> this commend.
>
> from Bio import Entrez
> handle=Entrez.efetch(db="nuccore",id="256615878",rettype="gb")
> record=handle.read()
> save_file = open("record.xml","w")
> save_file.write(record)
> save_file.close()

That did NOT save the record as XML format. You asked NCBI
Entrez EFetch for a GenBank file (rettype="gb").

> What I need is all the protein ID (For example: EEU21068.1) or GI
> number (for example: 256615878) in this Genbank page for the blast
> search. Could you let me know how to extract these information, save
> in some format, and use them?

If all you want is the accession, it is pointless to download
the entire record (with its features and sequence). Instead try:

>>> print Entrez.efetch(db="nuccore",id="256615878",rettype="acc", retmode="text").read()
GG698814.1

Note that a nucleotide sequence doesn't have a protein ID!
A gene nucleotide should have a single associated protein.
A genome sequence will have many associated proteins
(this seems to be what you want?).

If you really do want the GenBank file (e.g. for some other data),
then first save it and then parse it using Bio.SeqIO like this:

>>> from Bio import Entrez
>>> net_handle = Entrez.efetch(db="nuccore",id="256615878",rettype="gb")
>>> save_handle = open("record.gb", "w")
>>> save_handle.write(net_handle.read())
>>> save_handle.close()
>>> net_handle.close()

Then,

>>> from Bio import SeqIO
>>> record = SeqIO.read(open("record.gb"), "gb")
>>> print record.id
GG698814.1

You can also look at the CDS features (proteins), and their
lists of protein ID(s) and database cross references:

>>> for feature in record.features :
...     if feature.type != "CDS" : continue
...     print feature.qualifiers.get("protein_id", []),
...     print feature.qualifiers.get("db_xref", [])
...
['EEU21067.1'] ['GI:256615879']
['EEU21068.1'] ['GI:256615880']
['EEU21069.1'] ['GI:256615881']
['EEU21070.1'] ['GI:256615882']
['EEU21071.1'] ['GI:256615883']
...

However, if that is all you need, then it is a waste to download the
full GenBank file. Try using NCBI Entrez ELink instead?
http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html

Peter




More information about the Biopython mailing list