[Biopython-dev] Bio.WWW.ExPASy

Michiel De Hoon mdehoon at c2b2.columbia.edu
Fri Nov 30 04:00:24 UTC 2007

> > Bio.WWW.ExPASy contains six functions:
> >
> > get_prodoc_entry  Interface to the get-prodoc-entry CGI script.
> > get_prosite_entry Interface to the get-prosite-entry CGI script.
> > get_prosite_raw   Interface to the get-prosite-raw CGI script.
> > get_sprot_raw     Interface to the get-sprot-raw CGI script.
> > sprot_search_ful  Interface to the sprot-search-ful CGI script.
> > sprot_search_de   Interface to the sprot-search-de CGI script.
> >
> > plus an internally used function _open....
> >
> > Any comments?
> Is it worth adding "download many" functions like the one in GenBank?
> If the web API doesn't let us download a list of records by ID, then
> we might need some sort of handle wrapper to download them one by one
> - that might be too complicated.

For now, I'd like to focus just on the existing functions.

> Also, regarding handling HTML error pages, you could reuse the simple
> code in Bio/WWW/NCBI.py which hunts for certain HTML errors.  Or more
> simply, try and spot when we get an HTML file instead of plain text?
> There is a good case here for a shared "open a URL function" which can
> spot HTML error pages.

Some of the functions in Bio.WWW.ExPASy return the records as a HTML page, so
just checking if the returned file is HTML or not won't suffice to find
non-existing keys. See the following table, showing the format of the record
/ the format of errors:
get_prodoc_entry  HTML/HTML
get_prosite_entry HTML/HTML
get_prosite_raw   Raw/Nothing
get_sprot_raw     Raw/HTML
sprot_search_ful  HTML/HTML
sprot_search_de   HTML/HTML

Now, there are _extract_record functions in Bio.Prosite and
Bio.Prosite.Prodoc that take the output from get_prodoc_entry,
get_prosite_entry and fish out the record from the HTML. One possibility
would be to let the _extract_record function check for HTML error pages; this
function raises "ValueError: No data found in web page." if no data is found
in the HTML. But, if we call the appropriate _extract_record from
get_prodoc_entry, get_prosite_entry, then these two functions become very
close to what is already in Bio.Prosite.ExPASyDictionary and
Bio.Prosite.Prodoc.ExPASyDictionary. The ExPASyDictionaries use
get_prodoc_entry, get_prosite_entry to access ExPASy, and then extract the
record. So, if we call _extract_record inside get_prodoc_entry,
get_prosite_entry, then Bio.WWW.ExPASy.get_prodoc_entry(key) is virtually the
same as Bio.Prosite.Prodoc.ExPASyDictionary[key], and the same for Prosite.

So I think we have two options:
1) Bio.ExPASy contains the low-level functions to access ExPASy. No error
checking whatsoever; the calling function is responsible for making sure 
that there is actually a record contained in the results.
Bio.Prosite.ExPASyDictionary, Bio.Prosite.Prodoc.ExPASyDictionary,
Bio.SwissProt.Sprot.ExPASyDictionary contain the high-level functions to
access ExPASy, which do the error checking and extract the record from the
2) Make the only high-level functions available, to make sure the error
checking is always done.

My preference is 1).


Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


More information about the Biopython-dev mailing list