[BioPython] Problems parsing proteins

Thu May 6 12:03:25 EDT 2004

Hi Julio;

[The following NCBIDictionary code gives errors]:
> from Bio import GenBank
> rParser = GenBank.FeatureParser()
> rDict = GenBank.NCBIDictionary(database='protein'),parser = rParser)
> rDict[33990955]
>
> If you try in a browser:
> 
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=33990955
> 
> It works (so the id 33990955 is available).

The long and short of it is that NCBIDictionary uses old Entrez
retrieval code, which really doesn't work reliably any longer.
Getting just about anything about from NCBI should be done using the
new EUtils interface, which is supported and much more reliable.

So, I've updated the code for NCBIDictionary, search_for and download_many
in Bio.GenBank to use the EUtils interfaces. This is checked into
CVS, and will be available in the next release (which I hope to make
soon).

To get working code for yourself, you can do one of two things:

1. Update Biopython from anonymous CVS (see http://cvs.biopython.org
for instructions) and use the new code.

2. Use the EUtils interface, which is in the 1.24 release. To
retrieve using EUtils, your code would need to look like:

# get the record
from Bio.EUtils import DBIds, DBIdsClient
db_ids = DBIds("protein", ["33990955"])
eutils_client = DBIdsClient.from_dbids(db_ids)
result_handle = eutils_client.efetch(retmode = "text", rettype = "gp")

# parse the result
from Bio import GenBank
parser = GenBank.FeatureParser()
rec = parser.parse(result_handle)

Thanks for the report. I hope one of the two solutions works for
you.
Brad