[Bioperl-l] how to prevent forced exit?
Chris Fields
cjfields at illinois.edu
Tue Mar 15 15:54:38 UTC 2011
Jim,
It's worth noting in something more prominent, like the tutorial, FAQ, and appropriate HOWTO's. We've long considered whether it would be worth setting up a cookbook-like section that includes simple workflows, that would also be appropriate for something like this.
Also, I do think this is mentioned in the POD for several modules, but worth adding a prominent section on it if it's not present. In Bio::DB::GenBank:
WARNING: Please do NOT spam the Entrez web server with multiple requests. NCBI offers Batch Entrez for this purpose.
That could be placed somewhere that is a bit more helpful.
chris
On Mar 15, 2011, at 10:25 AM, Jim Hu wrote:
> Hi Chris,
>
> A version of this admonition should be on every wiki HOWTO that involves retrieving records from external sources, and in the docs for the relevant modules. Speaking as someone who has used BioPerl intermittently for years, and who has Sysiphus-like relationship with the learning curve, I think the docs could use more discussion of when to use particular modules in addition to the details of how to use them provided in the perldocs. I realize this is hard, given the perl "more than one way to do it" world view, but that's my $0.02.
>
> Since BioPerl.org is a wiki, I suppose I should do that admonition edit myself... especially since I already know the wiki markup to transclude the same text into multiple pages.
>
> Jim
>
> Sent from my iPad
>
> On Mar 15, 2011, at 9:44 AM, Chris Fields <cjfields at illinois.edu> wrote:
>
>> Ross,
>>
>> Hope you're exaggerating, you really shouldn't use this service for retrieving 1 million records as you'll likely find your IP banned by NCBI; they are starting to enforce stricter web-based access to their server now. Bio::DB::GenBank uses a GET HTTP request using URI-based parameters which effectively limits the length of the query to around 200-300 IDs per request, so you would have to split single one large request many. Thousands of repeated requests, even with a timeout, may flag your IP as 'spam'. You can use something like Bio::DB::EUtilities to grab larger groups of seqs (~1000 IDs) b/c the latest EUtilities uses POST requests vs GET for a large number of IDs, but you are still effectively limited by the number of requests.
>>
>> Frankly, there are much better/faster ways to do this, not least of which is to just download a GenBank section and parse it directly, or use a BLAST-formatted database and fastacmd to get the seqs of interest in FASTA format. Any reason why you are not doing this?
>>
>> chris
>>
>> On Mar 15, 2011, at 9:16 AM, Ross KK Leung wrote:
>>
>>> While the complete code is as follows, the real problem is that the get_Stream_by_acc cannot be used repeatedly, such that when I'm feeding a list of accession numbers (e.g. 1 million records) to the perl script, the program will exit with code 255 (likely equivalent to -1). I wonder anybody had encountered this similar problem and has solved it accordingly...
>>>
>>>
>>>
>>>
>>>
>>> #!/usr/bin/perl use warnings;
>>>
>>> use Bio::DB::GenBank;
>>>
>>>
>>>
>>> $gb = new Bio::DB::GenBank(-retrievaltype => 'tempfile', -format =>'Fasta');
>>>
>>>
>>> $allseqobj = $gb->get_Stream_by_acc("A3ZI37");
>>>
>>>
>>> print "HEELO";
>>>
>>> while ($seqobj = $allseqobj->next_seq) {
>>> #$seqobj = $allseqobj->next_seq;
>>> $seq=$seqobj->seq;
>>> }
>>> print "222 HEELO";
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Dave Messina [mailto:David.Messina at sbc.su.se]
>>> Sent: 2011年3月15日 17:02
>>> To: Ross KK Leung
>>> Cc: bioperl-l at lists.open-bio.org
>>> Subject: Re: [Bioperl-l] how to prevent forced exit?
>>>
>>>
>>>
>>> Hi Ross,
>>>
>>>
>>>
>>> Your code is incomplete and you didn't provide the output from running it, so it's not easy to figure out where you're going wrong.
>>>
>>>
>>>
>>> Try copying the example code directly from here
>>>
>>>
>>>
>>> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/GenBank.html
>>>
>>>
>>>
>>> and making sure that works first before modifying it.
>>>
>>>
>>>
>>>
>>>
>>> More documentation and examples here:
>>>
>>> http://www.bioperl.org/wiki/HOWTO:Beginners
>>>
>>> http://www.bioperl.org/wiki/Bioperl_scripts
>>>
>>>
>>>
>>>
>>>
>>> Dave
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 15, 2011 at 06:54, Ross KK Leung <ross at cuhk.edu.hk> wrote:
>>>
>>> $gb = new Bio::DB::GenBank(-retrievaltype => 'tempfile', -format =>
>>> 'Fasta');
>>> $allseqobj = $gb->get_Stream_by_acc("A3ZI37");
>>> l
>>> print "HEELO";
>>> while ($seqobj = $allseqobj->next_seq) {
>>> #$seqobj = $allseqobj->next_seq;
>>>
>>> $seq=$seqobj->seq;
>>>
>>> }
>>>
>>> print "222 HEELO";
>>>
>>>
>>>
>>> I find that the 1st HEELO can be printed while the 2nd one can't. Google
>>> does not return checking success/failure or null/exist of the Seq Object. As
>>> the 1st HEELO can be executed, so no throw/exception occurs for the
>>> get_Stream_by_acc. So what can I do? The real case is not hard-coding this
>>> A3ZI37 but reading a file that may contain a lot of these "illegitimate"
>>> accession numbers.
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list