[Biopython-dev] Python 3 and encoding for online resources

Tue Aug 3 15:44:49 UTC 2010

Have you tried looking at handle.info(), where handle is the handle returned by urllib.urlopen()? Another candidate is handle.getcode(). Otherwise, we could try to contact NCBI to see if their error messages can be returned in a standard format, or at least in a format consistent with the request. Otherwise, we can also consider not to parse the HTML error message; the SeqIO/Entrez parsers will notice a format problem and raise an exception anyway.

--Michiel.

--- On Tue, 8/3/10, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython-dev] Python 3 and encoding for online resources
> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
> Date: Tuesday, August 3, 2010, 10:07 AM
> Peter wrote:
> >Michiel wrote:
> >> So I would suggest to switch from urllib to
> urllib2 in Bio.Entrez and catch
> >> any HTTP errors (urllib2 is translated
> appropriately by 2to3),
> >
> > That sounds very sensible.
> >
> 
> Hi Michiel,
> 
> I see you've switched from urllib to urllib2, but you also
> removed all
> the NCBI specific error handling (which it turns out would
> need to be
> updated).
> 
> I just tried a simple history example and if you
> deliberately use a
> wrong webenv you get an HTML error page back (from memory
> and the comments in our code it used to be a plain text
> error page):
> 
> <html>
> <body>
> <br/><h2>Error occurred: Unable to obtain query
> #1</h2><br/><ul
> title="some params from request:">
> <li>db=pubmed</li>
> <li>query_key=1</li>
> <li>report=medline</li>
> <li>dispstart=0</li>
> <li>dispmax=10</li>
> <li>mode=text</li>
> <li>WebEnv=wrong</li>
> </ul>
> <br/><b>pmfetch need
> params:</b><br/><br/>
> <li>(id=NNNNNN[,NNNN,etc]) or (query_key=NNN, where
> NNN - number in
> the history, 0 - clipboard content for current
> database)</li>
> <li>db=db_name (mandatory)</li>
> <li>report=[docsum, brief, abstract, citation,
> medline, asn.1, mlasn1,
> uilist, sgml, gen] (Optional; default is asn.1)</li>
> <li>mode=[html, file, text, asn.1, xml] (Optional;
> default is html)</li>
> <li>dispstart - first element to display, from 0 to
> count - 1,
> (Optional; default is 0)</li>
> <li>dispmax - number of items to display (Optional;
> default is all
> elements, from dispstart)</li>
> <br/>See <a href="http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html">help</a>.</body>
> </html>
> 
> The old code could handle this just by looking for "Error
> occurred".
> 
> Anyway, this demonstrates that we can't just assume any
> error will
> be handled by the NCBI as an HTTP error code and thus get
> turned into an exception automatically by urllib2. In this
> particular
> case, one might argue the NCBI should use HTTP status code
> 400 Bad Request.
> 
> I think we should write some online tests for Bio.Entrez
> including error conditions like this.
> 
> In a related example, I'm trying added a sleep statement
> between
> my ESearch and EFetch calls in order let the session time
> out.
> I'll post back once I know what it does - but I'll be
> pleasantly
> surprised if they do something like HTTP status code 410
> Gone,
> I'm expecting another HTML error page.
> 
> Regards,
> 
> Peter
>