[Bioperl-l] Fwd: Question regarding Bio::GenBank module
Chris Fields
cjfields at uiuc.edu
Wed Aug 8 19:41:34 UTC 2007
NCBI eUtils (which Bio::DB::GenBank uses to get sequence data) has a
list of user requirements:
http://www.ncbi.nlm.nih.gov/entrez/query/static/
eutils_help.html#UserSystemRequirements
The most important one is the 3 second timeout between requests, but
the module already implements that policy so there isn't a real issue
unless you deliberately mess with that setting. NCBI has been known
to block IPs which don't follow that particular rule. Also, if you
are planning making hundreds of requests you should consider running
the script during low traffic times as indicated in the above link.
chris
On Aug 8, 2007, at 2:16 PM, Jason Stajich wrote:
> Young -
> I'm forwarding to the list for more help.
>
> Begin forwarded message:
>
>> From: "Young Song" <youngcsong at gmail.com>
>> Date: August 8, 2007 1:48:29 PM CDT
>> To: jason at bioperl.org
>> Subject: Question regarding Bio::GenBank module
>>
>> Hello,
>>
>> I am currently located in Vancouver, Canada, and I actually have
>> some
>> question based on the Bio::GenBank module for bioperl. I read in the
>> online document for the module (
>> http://search.cpan.org/dist/bioperl/Bio/DB/GenBank.pm), that we are
>> not
>> supposed to spam the NCBI with multiple requests, which lead me to
>> think
>> about the script that I wrote. I am trying to extract some
>> information
>> based on the fasta protein files located in the NCBI's database.
>> The
>> script reads each '.faa' (Fasta Protein) file and takes in the
>> 'gi' ID
>> for each sequence, and extracts several information, which looks
>> like
>> following output (please note that there are lot more gi's then I
>> am showing
>> you right now):
>>
>> 10954456
>> accesstion number: NP_047185.1
>> dbsource: GenBank: NC_001911.1
>> NP_047185.1
>> starting pos. at genomic seq: 1488
>> ending pos. at genomic seq: 1991
>> strand: +
>> description: putative membrane-associated protein
>> organism: Buchnera aphidicola
>> MERIIEKAIYASRWLMFPVYVGLSFGFILLTLKFFQQIVFIIPDILAMSESGLVLVVLSLIDIALVGGL
>> L
>> VMVMFLGYENFISKMDIQDNEKRLGWMGTMDVNSIKNKVASSIVAISSVHLLRLFMEAEKILDDKIMLC
>> V
>> IIHLTFVLSAFGMAYIDKMSKKKHVLH
>> ************************************************
>> 10954457
>> accesstion number: NP_047186.1
>> dbsource: GenBank: NC_001911.1
>> NP_047186.1
>> starting pos. at genomic seq: 2158
>> ending pos. at genomic seq: 2913
>> strand: +
>> description: putative replication-associated protein
>> organism: Buchnera aphidicola
>> MPRKNYIYNPKPVFNPPKNKRKISTFICYAMKKASEIDVARSNLNYTLLLIDPKTGNILPRFRRLNEHR
>> A
>> CAMRAIVLAMLYYFDIHSNLVEASIEKLADECGLSTFSDSGNKSITRVSRLINDFLEPMGFVRCKKIKR
>> K
>> FVSNYIPKKIFLTPMFFMLFNISQSKINRYLFKSKKMSQNLKITEKKIFISFSDIKVMSRLDEKSIRKK
>> I
>> LNALINYYTASELTKIGPKGLKKRIDIEYNNLCKLFKKIKK
>>
>>
>>
>> Because there are lot of sequences I am dealing with here, I am
>> little bit
>> worried that I may be causing harm to the NCBI server. I just need
>> to know
>> if this is the right approach to take, or if there is another
>> solution (I am
>> little bit confused what you mean by "multiple requests" in the
>> document).
>> Your reply would be very much appreciated. Thank you in advance.
>>
>> Sincerely,
>>
>> Young C. Song
>
> --
> Jason Stajich
> jason at bioperl.org
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list