[Bioperl-l] Fwd: Question regarding Bio::GenBank module

Wed Aug 8 19:41:34 UTC 2007

NCBI eUtils (which Bio::DB::GenBank uses to get sequence data) has a  
list of user requirements:

http://www.ncbi.nlm.nih.gov/entrez/query/static/ 
eutils_help.html#UserSystemRequirements

The most important one is the 3 second timeout between requests, but  
the module already implements that policy so there isn't a real issue  
unless you deliberately mess with that setting.  NCBI has been known  
to block IPs which don't follow that particular rule.  Also, if you  
are planning making hundreds of requests you should consider running  
the script during low traffic times as indicated in the above link.

chris

On Aug 8, 2007, at 2:16 PM, Jason Stajich wrote:

> Young -
> I'm forwarding to the list for more help.
>
> Begin forwarded message:
>
>> From: "Young Song" <youngcsong at gmail.com>
>> Date: August 8, 2007 1:48:29 PM CDT
>> To: jason at bioperl.org
>> Subject: Question regarding Bio::GenBank module
>>
>> Hello,
>>
>>    I am currently located in Vancouver, Canada, and I actually have
>> some
>> question based on the Bio::GenBank module for bioperl.  I read in the
>> online document for the module (
>> http://search.cpan.org/dist/bioperl/Bio/DB/GenBank.pm), that we are
>> not
>> supposed to spam the NCBI with multiple requests, which lead me to
>> think
>> about the script that I wrote.  I am trying to extract some
>> information
>> based on the fasta protein files located in the  NCBI's  database.
>> The
>> script  reads  each '.faa' (Fasta Protein) file and takes in the
>> 'gi'  ID
>> for each  sequence, and extracts several information, which looks  
>> like
>> following output (please note that there are lot more gi's then I
>> am showing
>> you right now):
>>
>> 10954456
>> accesstion number: NP_047185.1
>> dbsource: GenBank: NC_001911.1
>> NP_047185.1
>> starting pos. at genomic seq: 1488
>> ending pos. at genomic seq: 1991
>> strand: +
>> description: putative membrane-associated protein
>> organism: Buchnera aphidicola
>> MERIIEKAIYASRWLMFPVYVGLSFGFILLTLKFFQQIVFIIPDILAMSESGLVLVVLSLIDIALVGGL 
>> L
>> VMVMFLGYENFISKMDIQDNEKRLGWMGTMDVNSIKNKVASSIVAISSVHLLRLFMEAEKILDDKIMLC 
>> V
>> IIHLTFVLSAFGMAYIDKMSKKKHVLH
>> ************************************************
>> 10954457
>> accesstion number: NP_047186.1
>> dbsource: GenBank: NC_001911.1
>> NP_047186.1
>> starting pos. at genomic seq: 2158
>> ending pos. at genomic seq: 2913
>> strand: +
>> description: putative replication-associated protein
>> organism: Buchnera aphidicola
>> MPRKNYIYNPKPVFNPPKNKRKISTFICYAMKKASEIDVARSNLNYTLLLIDPKTGNILPRFRRLNEHR 
>> A
>> CAMRAIVLAMLYYFDIHSNLVEASIEKLADECGLSTFSDSGNKSITRVSRLINDFLEPMGFVRCKKIKR 
>> K
>> FVSNYIPKKIFLTPMFFMLFNISQSKINRYLFKSKKMSQNLKITEKKIFISFSDIKVMSRLDEKSIRKK 
>> I
>> LNALINYYTASELTKIGPKGLKKRIDIEYNNLCKLFKKIKK
>>
>>
>>
>>   Because there are lot of sequences I am dealing with here, I am
>> little bit
>> worried that I may be causing harm to the NCBI server.  I just need
>> to know
>> if this is the right approach to take, or if there is another
>> solution (I am
>> little bit confused what you mean by "multiple requests" in the
>> document).
>> Your reply would be very much appreciated.  Thank you in advance.
>>
>>   Sincerely,
>>
>>      Young C. Song
>
> --
> Jason Stajich
> jason at bioperl.org
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign