[Biopython-dev] New BLAST web page

Brad Chapman chapmanb at arches.uga.edu
Sun Mar 18 09:23:43 EST 2001


Hi Jeff;
Thanks for helping me out with this! 

> Looks like our friends at NCBI have changed
> the BLAST pages again and added a bunch of new options.  Looking at the
> page source for the CGI page, they seem to have added a bunch of hidden
> fields that the CGI script isn't happy to be without:

Oooh tricky, hidden fields. Thanks for the pointer on these, this gave 
me the push I needed to get by where I was stuck at.

> This gets it past the first page, and now gets to the page that tells you
> to wait for the results.  However, it's been on that page for a while, so
> I don't know if this is completely going to work, or if NCBI is just slow
> now!

Well, it didn't completely finish us up, but getting me past getting
back the same query page was all I needed :-). Quite a few things have 
changed in parsing the pages. Since they are now using javascript
(bleah!), some of the information is in new places.

But, I think I got it all sorted out and have got things
working. Attached is a patch against the current CVS which seems to
get NCBIWWW.blast working for me again. If this seems to work good to
people, I'll be happy to check it in.

Also included in a patch is a change to the WWW parser. It looks like
the format has changed yet again -- they now appear to be putting the
database before the Query=. The new version seems to parse the new
stuff correctly, and passed all of the tests with the old versions.

So, with this patch it seems like we work again with NCBI Blast. Whew!
I hope this works right for everyone else.

Thanks again for the help, Jeff!

Brad

-------------- next part --------------
*** NCBIWWW.py.orig	Sat Feb 10 08:32:39 2001
--- NCBIWWW.py	Sun Mar 18 09:09:05 2001
***************
*** 153,164 ****
--- 153,195 ----
          # Brad Chapman noticed a '<p>' line in BLASTN 2.1.1
          attempt_read_and_call(uhandle, consumer.noevent, start='<p>')
  
+         # 2.1.2 has the database right and blastform after the RID
+         database_read = 0
+         if attempt_read_and_call(uhandle, consumer.noevent, start = '<p>'):
+             self._scan_database_info(uhandle, consumer)
+             # read until we get to a <BR> before the Query=
+             read_and_call_until(uhandle, consumer.noevent, start = '<BR>')
+             read_and_call(uhandle, consumer.noevent, start = '<BR>')
+             database_read = 1
+ 
          # Read the Query lines and the following blank line.
          read_and_call(uhandle, consumer.query_info, contains='Query=')
          read_and_call_until(uhandle, consumer.query_info, blank=1)
          read_and_call_while(uhandle, consumer.noevent, blank=1)
  
          # Read the database lines and the following blank line.
+         # only read the database if it hasn't already been read
+         if not(database_read):
+             self._scan_database_info(uhandle, consumer)
+ 
+             # Read the blast form, if it exists. 
+             if attempt_read_and_call(uhandle, consumer.noevent,
+                                      contains='BLASTFORM'):
+                 read_and_call_until(uhandle, consumer.noevent, blank=1)
+             elif attempt_read_and_call(uhandle, consumer.noevent,
+                                        start='<PRE>'):
+                 read_and_call_until(uhandle, consumer.noevent, blank=1)
+         # otherwise we'll need to scan a <PRE> tag
+         else:
+             read_and_call(uhandle, consumer.noevent, start = '<PRE>')
+             
+ 
+         # Read the blank lines until the next section.
+         read_and_call_while(uhandle, consumer.noevent, blank=1)
+ 
+         consumer.end_header()
+ 
+     def _scan_database_info(self, uhandle, consumer):
          attempt_read_and_call(uhandle, consumer.noevent, start='<p>')
          read_and_call(uhandle, consumer.database_info, contains='Database')
          read_and_call(uhandle, consumer.database_info, contains='sequences')
***************
*** 166,183 ****
          read_and_call(uhandle, consumer.noevent,
                        contains='problems or questions')
  
-         # Read the blast form, if it exists. 
-         if attempt_read_and_call(uhandle, consumer.noevent,
-                                  contains='BLASTFORM'):
-             read_and_call_until(uhandle, consumer.noevent, blank=1)
-         elif attempt_read_and_call(uhandle, consumer.noevent, start='<PRE>'):
-             read_and_call_until(uhandle, consumer.noevent, blank=1)
- 
-         # Read the blank lines until the next section.
-         read_and_call_while(uhandle, consumer.noevent, blank=1)
- 
-         consumer.end_header()
- 
      def _scan_rounds(self, uhandle, consumer):
          self._scan_descriptions(uhandle, consumer)
          self._scan_alignments(uhandle, consumer)
--- 197,202 ----
***************
*** 530,566 ****
  
          consumer.end_parameters()
  
  
! def blast(program, datalib, sequence,
!           input_type='Sequence in FASTA format',
!           double_window=None, gi_list='(None)',
!           list_org = None, expect='10',
!           filter='L', genetic_code='Standard (1)',
!           mat_param='PAM30     9       1',
!           other_advanced=None, ncbi_gi=None, overview=None,
!           alignment_view='0', descriptions=None, alignments=None,
!           email=None, path=None, html=None, 
!           cgi='http://www.ncbi.nlm.nih.gov/blast/blast.cgi',
!           timeout=20
!           ):
!     """blast(program, datalib, sequence,
!     input_type='Sequence in FASTA format',
!     double_window=None, gi_list='(None)',
!     list_org = None, expect='10',
!     filter='L', genetic_code='Standard (1)',
!     mat_param='PAM30     9       1',
!     other_advanced=None, ncbi_gi=None, overview=None,
!     alignment_view='0', descriptions=None, alignments=None,
!     email=None, path=None, html=None, 
!     cgi='http://www.ncbi.nlm.nih.gov/blast/blast.cgi',
!     timeout=20) -> handle
! 
!     Do a BLAST search against NCBI.  Returns a handle to the results.
!     timeout is the number of seconds to wait for the results before timing
!     out.  The other parameters are provided to BLAST.  A description
!     can be found online at:
!     http://www.ncbi.nlm.nih.gov/BLAST/newoptions.html
  
      """
      # NCBI Blast is hard to work with.  The user enters a query, and then
      # it returns a "reference" page which contains a button that the user
--- 549,605 ----
  
          consumer.end_parameters()
  
+ def blast(program, database, query,
+           entrez_query = '(none)',
+           filter = 'L',
+           expect = '10',
+           word_size = None,
+           ungapped_alignment = 'no',
+           other_advanced = None,
+           cdd_search = 'on',
+           composition_based_statistics = None,
+           matrix_name = None,
+           run_psiblast = None,
+           i_thresh = '0.001',
+           genetic_code = '1',
+           show_overview = 'on',
+           ncbi_gi = 'on',
+           format_object = 'alignment',
+           format_type = 'html',
+           descriptions = '100',
+           alignments = '50',
+           alignment_view = 'Pairwise',
+           auto_format = 'on',
+           cgi='http://www.ncbi.nlm.nih.gov/blast/Blast.cgi',
+           timeout = 20):
+     """Blast against the NCBI Blast web page.
+ 
+     This uses the NCBI web page cgi script to BLAST, and returns a handle
+     to the results. See:
+     
+     http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html
+ 
+     for more descriptions about the options.
+ 
+     Required Inputs:
+     o program - The name of the blast program to run (ie. blastn, blastx...)
+     o database - The database to search against (ie. nr, dbest...)
+     o query - The input for the search, which NCBI tries to autodetermine
+     the type of. Ideally, this would be a sequence in FASTA format.
+ 
+     General Options:
+     filter, expect, word_size, other_advanced
+ 
+     Formatting Options:
+     show_overview, ncbi_gi, format_object, format_type, descriptions,
+     alignments, alignment_view, auto_format
  
!     Protein specific options:
!     cdd_search, composition_based_statistics, matrix_name, run_psiblast,
!     i_thresh
  
+     Translated specific options:
+     genetic code
      """
      # NCBI Blast is hard to work with.  The user enters a query, and then
      # it returns a "reference" page which contains a button that the user
***************
*** 571,619 ****
      # page to figure out how to retrieve the results.  Then, it needs to
      # check the results to see if the search has been finished.
      params = {'PROGRAM' : program,
!               'DATALIB' : datalib,
!               'SEQUENCE' : sequence,
!               'DOUBLE_WINDOW' : double_window,
!               'GI_LIST' : gi_list,
!               'LIST_ORG' : list_org,
!               'INPUT_TYPE' : input_type,
!               'EXPECT' : expect,
                'FILTER' : filter,
                'GENETIC_CODE' : genetic_code,
!               'MAT_PARAM' : mat_param,
!               'OTHER_ADVANCED' : other_advanced,
                'NCBI_GI' : ncbi_gi,
!               'OVERVIEW' : overview,
!               'ALIGNMENT_VIEW' : alignment_view,
                'DESCRIPTIONS' : descriptions,
                'ALIGNMENTS' : alignments,
!               'EMAIL' : email,
!               'PATH' : path,
!               'HTML' : html
!               }
      variables = {}
      for k in params.keys():
          if params[k] is not None:
              variables[k] = str(params[k])
      # This returns a handle to the HTML file that points to the results.
!     handle = NCBI._open(cgi, variables, get=0)
      # Now parse the HTML from the handle and figure out how to retrieve
      # the results.
      refcgi, params = _parse_blast_ref_page(handle, cgi)
  
      start = time.time()
      while 1:
          # Sometimes the BLAST results aren't done yet.  Look at the page
          # to see if the results are there.  If not, then try again later.
          handle = NCBI._open(cgi, params, get=0)
!         ready, results, refresh_delay = _parse_blast_results_page(handle)
          if ready:
              break
          # Time out if it's not done after timeout minutes.
          if time.time() - start > timeout*60:
              raise IOError, "timed out after %d minutes" % timeout
!         # pause and try again.
!         time.sleep(refresh_delay)
      return File.UndoHandle(File.StringHandle(results))
  
  def _parse_blast_ref_page(handle, base_cgi):
--- 610,691 ----
      # page to figure out how to retrieve the results.  Then, it needs to
      # check the results to see if the search has been finished.
      params = {'PROGRAM' : program,
!               'DATABASE' : database,
!               'QUERY' : query,
!               'ENTREZ_QUERY' : entrez_query,
                'FILTER' : filter,
+               'EXPECT' : expect,
+               'WORD_SIZE' : word_size,
+               'UNGAPPED_ALIGNMENT' : ungapped_alignment,
+               'OTHER_ADVANCED': other_advanced,
+               'CDD_SEARCH' : cdd_search,
+               'COMPOSITION_BASED_STATISTICS' : composition_based_statistics,
+               'MATRIX_NAME' : matrix_name,
+               'RUN_PSIBLAST' : run_psiblast,
+               'I_THRESH' : i_thresh,
                'GENETIC_CODE' : genetic_code,
!               'SHOW_OVERVIEW' : show_overview,
                'NCBI_GI' : ncbi_gi,
!               'FORMAT_OBJECT' : format_object,
!               'FORMAT_TYPE' : format_type,
                'DESCRIPTIONS' : descriptions,
                'ALIGNMENTS' : alignments,
!               'ALIGNMENT_VIEW' : alignment_view,
!               'AUTO_FORMAT' : auto_format}
      variables = {}
      for k in params.keys():
          if params[k] is not None:
              variables[k] = str(params[k])
+             
+     variables['CLIENT'] = 'web'
+     variables['SERVICE'] = 'plain'
+     variables['CMD'] = 'Put'
+ 
+     if program.upper() == 'BLASTN':
+         variables['PAGE'] = 'Nucleotides'
+     elif program.upper() == 'BLASTP':
+         variables['PAGE'] = 'Proteins'
+     elif program.upper() in ['BLASTX', 'TBLASTN','TBLASTX']:
+         variables['PAGE'] = 'Translations'
+     else:
+         raise ValueError("Unexpected program name %s" % program)
+         
      # This returns a handle to the HTML file that points to the results.
!     handle = NCBI._open(cgi, variables, get = 0)
      # Now parse the HTML from the handle and figure out how to retrieve
      # the results.
      refcgi, params = _parse_blast_ref_page(handle, cgi)
  
+     # start with the initial recommended delay. Otherwise we get hit with
+     # an extra long delay right away
+     if params.has_key("RTOE"):
+         refresh_delay = int(params["RTOE"]) + 1
+         del params["RTOE"]
+     else:
+         refresh_delay = 5
+ 
+     cgi = refcgi
      start = time.time()
      while 1:
+         # pause before trying to get the results
+         time.sleep(refresh_delay)
+         
          # Sometimes the BLAST results aren't done yet.  Look at the page
          # to see if the results are there.  If not, then try again later.
          handle = NCBI._open(cgi, params, get=0)
!         ready, results, refresh_delay, cgi = _parse_blast_results_page(handle)
!         
          if ready:
              break
          # Time out if it's not done after timeout minutes.
          if time.time() - start > timeout*60:
              raise IOError, "timed out after %d minutes" % timeout
! 
!     # now get the results page and return it
!     # -- the "ready" page from before is just a check page
!     result_handle = NCBI._open(refcgi, params, get=0)
!     results = result_handle.read()
!     
      return File.UndoHandle(File.StringHandle(results))
  
  def _parse_blast_ref_page(handle, base_cgi):
***************
*** 635,654 ****
                  if attr == 'ACTION':
                      self.cgi = urlparse.urljoin(self.cgi, value)
          def do_input(self, attributes):
!             # parse the "INPUT" tags to try and find the reference ID (RID)
!             is_rid = 0
!             rid = None
              for attr, value in attributes:
                  attr, value = string.upper(attr), string.upper(value)
!                 if attr == 'NAME' and value == 'RID':
!                     is_rid = 1
                  elif attr == 'VALUE':
!                     rid = value
!             if is_rid and rid:
!                 self.params['RID'] = rid
                  
      parser = RefPageParser(base_cgi)
!     parser.feed(handle.read())
      if not parser.params.has_key('RID'):
          raise SyntaxError, "Error getting BLAST results: RID not found"
      return parser.cgi, parser.params
--- 707,731 ----
                  if attr == 'ACTION':
                      self.cgi = urlparse.urljoin(self.cgi, value)
          def do_input(self, attributes):
!             # parse out all of the different inputs we are interested in
!             inputs = ["RID", "RTOE", "CLIENT", "CMD", "PAGE",
!                       "EXPECT", "DESCRIPTIONS", "ALIGNMENTS", "AUTO_FORMAT"]
! 
!             cur_input = None
!             
              for attr, value in attributes:
                  attr, value = string.upper(attr), string.upper(value)
!                 if attr == 'NAME' and value in inputs:
!                     cur_input = value
                  elif attr == 'VALUE':
!                     if cur_input is not None:
!                         if value:
!                             self.params[cur_input] = value
                  
      parser = RefPageParser(base_cgi)
!     html_info = handle.read()
!     
!     parser.feed(html_info)
      if not parser.params.has_key('RID'):
          raise SyntaxError, "Error getting BLAST results: RID not found"
      return parser.cgi, parser.params
***************
*** 659,679 ****
          def __init__(self):
              sgmllib.SGMLParser.__init__(self)
              self.ready = 0
              self.refresh = 5
          def handle_comment(self, comment):
!             comment = string.lower(comment)
!             if string.find(comment, 'status=ready') >= 0:
                  self.ready = 1
          _refresh_re = re.compile('REFRESH_DELAY=(\d+)', re.IGNORECASE)
!         def do_meta(self, attributes):
!             for attr, value in attributes:
!                 m = self._refresh_re.search(value)
!                 if m:
!                     self.refresh = int(m.group(1))
      results = handle.read()
      parser = ResultsParser()
      parser.feed(results)
!     return parser.ready, results, parser.refresh
  
  
  def blasturl(program, datalib, sequence,
--- 736,794 ----
          def __init__(self):
              sgmllib.SGMLParser.__init__(self)
              self.ready = 0
+             self.refresh_cgi = None
              self.refresh = 5
+ 
          def handle_comment(self, comment):
!             # determine if it is ready
!             if string.find(comment.lower(), 'status=ready') >= 0:
                  self.ready = 1
+             # otherwise, we need to parse for the delay and url
+             elif string.find(comment, 'location.href') >= 0:
+                 self.refresh_cgi, self.refresh = self._find_cgi_info(comment)
+ 
          _refresh_re = re.compile('REFRESH_DELAY=(\d+)', re.IGNORECASE)
!         def _find_cgi_info(self, comment):
!             """Find the refresh CGI string and refresh delay from a comment.
! 
!             We are parsing a comment string like:
!             setTimeout('location.href =
!             "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?
!             CMD=Get&RID=984874645-19210-15659&CHECK_STATUS_ONLY=yes&
!             REFRESH_DELAY=106&AUTO_FORMAT=yes&KEY=20111";',106000);
! 
!             Arguments:
! 
!             o comment - A comment which is assumed to have been checked to
!             have the refresh delay cgi string in it.
!             """
!             # find where the cgi string starts
!             href_string = 'location.href = "'
!             cgi_start_pos = string.find(comment, href_string)
!             assert cgi_start_pos is not -1, \
!                    "Unable to parse the start of the refresh cgi."
!             # the cgi starts at the end of the location.href stuff
!             cgi_start_pos += len(href_string)
! 
!             # find the end pos of the cgi string
!             cgi_end_pos = string.find(comment, '"', cgi_start_pos)
!             assert cgi_end_pos is not -1, \
!                    "Unable to parse end of refresh cgi."
! 
!             refresh_cgi = comment[cgi_start_pos:cgi_end_pos]
! 
!             # parse the refresh delay out of the comment
!             m = self._refresh_re.search(refresh_cgi)
!             assert m, "Failed to parse refresh time from %s" % refresh_cgi
!             refresh = int(m.group(1))
! 
!             return refresh_cgi, refresh
!                     
      results = handle.read()
+     
      parser = ResultsParser()
      parser.feed(results)
!     return parser.ready, results, parser.refresh, parser.refresh_cgi
  
  
  def blasturl(program, datalib, sequence,


More information about the Biopython-dev mailing list