[Bioperl-l] EUtilities term handling

Chris Fields cjfields at uiuc.edu
Thu Oct 5 14:31:06 UTC 2006


On Oct 5, 2006, at 2:19 AM, Sendu Bala wrote:

> This is actually a general question and not limited to EUtilities.  
> As I
> see it EUtiltiies lets you do queries in Bioperl that you can do on a
> website. The question is, should a Bioperl module always work with
> queries that the website it is a front-end to works with?
>
> So for example, Bio::DB::EUtilities::esearch in -db mode 'gene' is
> essentially a frontend onto:
>
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? 
> retmode=xml&db=gene&term=
>
> With a web-browser you can complete that url by supplying a term. For
> example, the term 'BRCA2+9606[taxid]' works and returns results:
>
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? 
> retmode=xml&db=gene&term=BRCA2+9606[taxid]
>
> If you supply the exact same term to EUtilities::esearch like so:
>
> my $esearch = Bio::DB::EUtilities->new(-eutil => "esearch", -db =>
> "gene", -term "BRCA2+9606[taxid]");
>
> The search fails. From my 'user' perspective this is highly  
> unexpected.
> Chris (the author) and I both understand /why/ it fails, but Chris
> doesn't think it is a bug, or at least something than can/should be
> changed. What do other people think? At the very least, if something
> unexpected happens, I'd suggest making a note of it in the POD
> somewhere. Eg. "Do not use + in term strings, even though they might
> work on the website".
>
> Chris: what is the disadvantage of always submitting '+' as '+' to the
> server?

A few reasons:

1)  According to NCBI, you can use '+' in queries, but not as a  
boolean.  Global changes of '+' to a space may change the meaning of  
the query in a few rare occasions.  So, if you really wanted to  
search for the string 'BRCA2+ATG', NCBI looks for that term literally.

2)  '+' is a URI reserved symbol for a space delimiter.  Therefore,  
any parameters containing '+' are URI-encoded into %2B, which is  
decoded on NCBI's end back to '+' (The is demonstrable with current  
EUtilities output and the returned XML data).

3)  Why not just use a space (implicit AND)?  Or an explicit  
boolean?  Or '&' (which apparently works but is not specified in the  
NCBI Entrez docs)?

The bug is in the query and not in the code, i.e. is is a  user- 
generated bug, not an EUtilities bug.  And it shouldn't be  
unexpected, as NCBI has very specific rules for building queries for  
Entrez (just like any other database).  If I were to use nonstandard  
queries for MySQL, BioFetch, UCSC, or anything else, I would expect  
to get bad results.  As the old saying goes, garbage in, garbage out.

The following link has their updated rules:

http://www.ncbi.nlm.nih.gov/books/bv.fcgi? 
rid=helpentrez.chapter.EntrezHelp

Here is their old one:

http://www.ncbi.nlm.nih.gov/entrez/query/static/help/helpdoc.html

We could, of course, put something in POD, but you never presented  
that option to me before.  I'll grant that the EUtilities API needs  
some cleaning up, not easy to do when the returned data varies from  
each utility.  But it does get the URL encoding correct, at least in  
this case.

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list