[Bioperl-l] NCBI GenBank web retrieval

Josiah Altschuler jaltschuler@mcb.Harvard.edu
Thu, 24 Jan 2002 16:10:38 -0500


I wasn't sure why the Boulder module didn't work anymore last week for
queries, so I put it aside and wrote code to submit queries to
http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query and just parsed the
HTML.  This seemed to work fine.  Is it not possible to do this with
Boulder?

Josiah


-----Original Message-----
From: Lincoln Stein [mailto:lstein@cshl.org]
Sent: Thursday, January 24, 2002 2:29 PM
To: Jason Stajich; Bioperl
Cc: Josiah Altschuler; Baumohl, Jason; pan@cshl.org
Subject: Re: [Bioperl-l] NCBI GenBank web retrieval


Hi All,

I just spent a few hours restoring partial functionality to
Boulder::Genbank. 
 I've been able to fix its ability to retrieve a list of accession numbers
by 
changing the URI to use the "demo" batch retriever at 
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/.  This seems to have exactly
the 
same API as the old batch retriever (now retired), but adds XML support, 
which is nice.  Unfortunately, the demo retriever doesn't want to return any

results in response to Entrez queries, so fetch by query doesn't work.
Dang.

The new version is uploaded to CPAN.  (I've also added proxy support for 
firewall users.)

The design goal for Boulder::Genbank, you will recall, was to allow either 
retrieving a long list of sequence accessions, or an arbitrary Entrez query 
in Fasta, Genbank, or Boulder(parsed) format.  It got around NCBI's download

limits by carefully breaking down the requests into small chunks and 
reissuing the requests as needed.  I was able to use this interface to fetch

all the Rice ESTs (many hundreds of thousands) at regular intervals, and 
didn't have to worry about timeouts and the like.

I would like to know whether the demo batch retriever is stable, or will go 
the same route as the previous batch retriever.  Also, should I just retire 
Boulder::Genbank?  If I do, does Bio::DB::GenBank support these big queries,

and if so how does it do it?

Lincoln

On Saturday 19 January 2002 17:48, Jason Stajich wrote:
> [jason having learned way too much about how to reverse engineer CGI]
>
> I've restored the functionality from previous versions of DB::GenBank and
> DB::GenPept as we are using the new NCBI cgi /htbin-post/Entrez/query.
> I was able to figure out that terms are encoded as being separated by '+'
> instead of the previous ',' which had been causing only one sequence to
> be retrieved.  Additionally I fixed a bug that retrieved the last rather
> than the first sequence for a request that has multiple hits and use
> get_Seq_by_(id|acc)
>
> I was unable to reactivate access to Batch entrez through
> /entrez/batchentrez.cgi as that only seems to return an HTML table and I
> am trying to avoid the 2-step query process at this time.  I attempted to
> mimic Lincoln's functionality in Boulder::Genbank here, but alas it
> appears that the previous /cgi-bin/Entrez/qserver.cgi/result is disabled.
> Lincoln - I believe this breaks Boulder 1.24 Entrez access as well.  I
> guess we can go to a 2-step retrieval by parsing HTML if people are
> interested.
>
> Are there limits to size of URLs ?  I thought there might be which could
> be a problem since the requests are sent as GETs not POSTs.  Otherwise we
> basically have batch entrez functionality back in.
>
> (Roger this is essentially the fix we talked about - as best as I can
> solve it so you can take it off your queue unless you've got ideas)
>
> -jason

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
========================================================================
_______________________________________________
Bioperl-l mailing list
Bioperl-l@bioperl.org
http://bioperl.org/mailman/listinfo/bioperl-l