[Biopython-dev] Bio.WWW.ExPASy

Fri Nov 30 04:00:24 UTC 2007

> > Bio.WWW.ExPASy contains six functions:
> >
> > get_prodoc_entry  Interface to the get-prodoc-entry CGI script.
> > get_prosite_entry Interface to the get-prosite-entry CGI script.
> > get_prosite_raw   Interface to the get-prosite-raw CGI script.
> > get_sprot_raw     Interface to the get-sprot-raw CGI script.
> > sprot_search_ful  Interface to the sprot-search-ful CGI script.
> > sprot_search_de   Interface to the sprot-search-de CGI script.
> >
> > plus an internally used function _open....
> >
> > Any comments?
>
> Is it worth adding "download many" functions like the one in GenBank?
> If the web API doesn't let us download a list of records by ID, then
> we might need some sort of handle wrapper to download them one by one
> - that might be too complicated.

For now, I'd like to focus just on the existing functions.

> Also, regarding handling HTML error pages, you could reuse the simple
> code in Bio/WWW/NCBI.py which hunts for certain HTML errors.  Or more
> simply, try and spot when we get an HTML file instead of plain text?
> There is a good case here for a shared "open a URL function" which can
> spot HTML error pages.

Some of the functions in Bio.WWW.ExPASy return the records as a HTML page, so
just checking if the returned file is HTML or not won't suffice to find
non-existing keys. See the following table, showing the format of the record
/ the format of errors:
get_prodoc_entry  HTML/HTML
get_prosite_entry HTML/HTML
get_prosite_raw   Raw/Nothing
get_sprot_raw     Raw/HTML
sprot_search_ful  HTML/HTML
sprot_search_de   HTML/HTML

Now, there are _extract_record functions in Bio.Prosite and
Bio.Prosite.Prodoc that take the output from get_prodoc_entry,
get_prosite_entry and fish out the record from the HTML. One possibility
would be to let the _extract_record function check for HTML error pages; this
function raises "ValueError: No data found in web page." if no data is found
in the HTML. But, if we call the appropriate _extract_record from
get_prodoc_entry, get_prosite_entry, then these two functions become very
close to what is already in Bio.Prosite.ExPASyDictionary and
Bio.Prosite.Prodoc.ExPASyDictionary. The ExPASyDictionaries use
get_prodoc_entry, get_prosite_entry to access ExPASy, and then extract the
record. So, if we call _extract_record inside get_prodoc_entry,
get_prosite_entry, then Bio.WWW.ExPASy.get_prodoc_entry(key) is virtually the
same as Bio.Prosite.Prodoc.ExPASyDictionary[key], and the same for Prosite.

So I think we have two options:
1) Bio.ExPASy contains the low-level functions to access ExPASy. No error
checking whatsoever; the calling function is responsible for making sure 
that there is actually a record contained in the results.
Bio.Prosite.ExPASyDictionary, Bio.Prosite.Prodoc.ExPASyDictionary,
Bio.SwissProt.Sprot.ExPASyDictionary contain the high-level functions to
access ExPASy, which do the error checking and extract the record from the
HTML.
2) Make the only high-level functions available, to make sure the error
checking is always done.

My preference is 1).

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032

Peter