[Bioperl-l] NCBI GenBank web retrieval
Lincoln Stein
lstein@cshl.org
Thu, 24 Jan 2002 14:29:08 -0500
Hi All,
I just spent a few hours restoring partial functionality to Boulder::Genbank.
I've been able to fix its ability to retrieve a list of accession numbers by
changing the URI to use the "demo" batch retriever at
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/. This seems to have exactly the
same API as the old batch retriever (now retired), but adds XML support,
which is nice. Unfortunately, the demo retriever doesn't want to return any
results in response to Entrez queries, so fetch by query doesn't work. Dang.
The new version is uploaded to CPAN. (I've also added proxy support for
firewall users.)
The design goal for Boulder::Genbank, you will recall, was to allow either
retrieving a long list of sequence accessions, or an arbitrary Entrez query
in Fasta, Genbank, or Boulder(parsed) format. It got around NCBI's download
limits by carefully breaking down the requests into small chunks and
reissuing the requests as needed. I was able to use this interface to fetch
all the Rice ESTs (many hundreds of thousands) at regular intervals, and
didn't have to worry about timeouts and the like.
I would like to know whether the demo batch retriever is stable, or will go
the same route as the previous batch retriever. Also, should I just retire
Boulder::Genbank? If I do, does Bio::DB::GenBank support these big queries,
and if so how does it do it?
Lincoln
On Saturday 19 January 2002 17:48, Jason Stajich wrote:
> [jason having learned way too much about how to reverse engineer CGI]
>
> I've restored the functionality from previous versions of DB::GenBank and
> DB::GenPept as we are using the new NCBI cgi /htbin-post/Entrez/query.
> I was able to figure out that terms are encoded as being separated by '+'
> instead of the previous ',' which had been causing only one sequence to
> be retrieved. Additionally I fixed a bug that retrieved the last rather
> than the first sequence for a request that has multiple hits and use
> get_Seq_by_(id|acc)
>
> I was unable to reactivate access to Batch entrez through
> /entrez/batchentrez.cgi as that only seems to return an HTML table and I
> am trying to avoid the 2-step query process at this time. I attempted to
> mimic Lincoln's functionality in Boulder::Genbank here, but alas it
> appears that the previous /cgi-bin/Entrez/qserver.cgi/result is disabled.
> Lincoln - I believe this breaks Boulder 1.24 Entrez access as well. I
> guess we can go to a 2-step retrieval by parsing HTML if people are
> interested.
>
> Are there limits to size of URLs ? I thought there might be which could
> be a problem since the requests are sent as GETs not POSTs. Otherwise we
> basically have batch entrez functionality back in.
>
> (Roger this is essentially the fix we talked about - as best as I can
> solve it so you can take it off your queue unless you've got ideas)
>
> -jason
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein@cshl.org Cold Spring Harbor, NY
========================================================================