[Bioperl-l] EUtilities term handling
Chris Fields
cjfields at uiuc.edu
Thu Oct 5 14:31:06 UTC 2006
On Oct 5, 2006, at 2:19 AM, Sendu Bala wrote:
> This is actually a general question and not limited to EUtilities.
> As I
> see it EUtiltiies lets you do queries in Bioperl that you can do on a
> website. The question is, should a Bioperl module always work with
> queries that the website it is a front-end to works with?
>
> So for example, Bio::DB::EUtilities::esearch in -db mode 'gene' is
> essentially a frontend onto:
>
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
> retmode=xml&db=gene&term=
>
> With a web-browser you can complete that url by supplying a term. For
> example, the term 'BRCA2+9606[taxid]' works and returns results:
>
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
> retmode=xml&db=gene&term=BRCA2+9606[taxid]
>
> If you supply the exact same term to EUtilities::esearch like so:
>
> my $esearch = Bio::DB::EUtilities->new(-eutil => "esearch", -db =>
> "gene", -term "BRCA2+9606[taxid]");
>
> The search fails. From my 'user' perspective this is highly
> unexpected.
> Chris (the author) and I both understand /why/ it fails, but Chris
> doesn't think it is a bug, or at least something than can/should be
> changed. What do other people think? At the very least, if something
> unexpected happens, I'd suggest making a note of it in the POD
> somewhere. Eg. "Do not use + in term strings, even though they might
> work on the website".
>
> Chris: what is the disadvantage of always submitting '+' as '+' to the
> server?
A few reasons:
1) According to NCBI, you can use '+' in queries, but not as a
boolean. Global changes of '+' to a space may change the meaning of
the query in a few rare occasions. So, if you really wanted to
search for the string 'BRCA2+ATG', NCBI looks for that term literally.
2) '+' is a URI reserved symbol for a space delimiter. Therefore,
any parameters containing '+' are URI-encoded into %2B, which is
decoded on NCBI's end back to '+' (The is demonstrable with current
EUtilities output and the returned XML data).
3) Why not just use a space (implicit AND)? Or an explicit
boolean? Or '&' (which apparently works but is not specified in the
NCBI Entrez docs)?
The bug is in the query and not in the code, i.e. is is a user-
generated bug, not an EUtilities bug. And it shouldn't be
unexpected, as NCBI has very specific rules for building queries for
Entrez (just like any other database). If I were to use nonstandard
queries for MySQL, BioFetch, UCSC, or anything else, I would expect
to get bad results. As the old saying goes, garbage in, garbage out.
The following link has their updated rules:
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?
rid=helpentrez.chapter.EntrezHelp
Here is their old one:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/helpdoc.html
We could, of course, put something in POD, but you never presented
that option to me before. I'll grant that the EUtilities API needs
some cleaning up, not easy to do when the returned data varies from
each utility. But it does get the URL encoding correct, at least in
this case.
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list