[Bioperl-l] get_Stream_by_query Terminates Prematurely
Chris Fields
cjfields at illinois.edu
Mon May 10 17:07:00 UTC 2010
(addendum added, sent too early)
On May 10, 2010, at 11:58 AM, Chris Fields wrote:
> 500000 sequences is way too many to request, even in a loop. Under most circumstances this is breaking NCBI's eutils policies:
>
> http://eutils.ncbi.nlm.nih.gov/#UserSystemRequirements
>
> so don't be too surprised this is failing (this would be around 1000 queried of 500 sequences per query).
>
> You could try pulling down the raw sequence via batch entrez or using Bio::DB::EUtilities (which should die if an error occurs).
But you may still run into issues with eutils at some point, particularly if running this at peak times.
>
> chris
>
> On May 9, 2010, at 9:22 PM, bergeycm wrote:
>
>>
>> Hi all,
>>
>> I'm attempting to query GenBank for all sequences' lengths for a given
>> taxon. I'm using get_Stream_by_query(), but only to grab the species,
>> length, and accession. The genus of interest has almost 500,000 GB entries,
>> though, and my code hangs up at odd points in the info-gathering loop.
>> (Often after only 300 or 400 iterations.) The problem is that
>> $stream_obj->next_seq (of Bio::SeqIO::genbank) eventually comes back
>> undefined.
>>
>> I've tried wrapping the next_seq portion of the code in an eval block, but
>> to no avail. Is there a way to split a query into a bunch of small streams
>> that aren't too much to ask? Or is there a way to pick up a dropped SeqIO
>> stream? I think the connection is timing out and the stream is being lost.
>> Any advice is greatly appreciated, as I'm fairly new to BioPerl.
>>
>> - bergeycm
>>
>>
>>
>> use Bio::DB::GenBank;
>> use Bio::DB::Query::GenBank;
>>
>>
>> # Get general things ready to go for querying GenBank
>> my %options;
>> $options{'-maxids'} = '500000'; # There are presently 460,184 sequences
>> $options{'-db'} = 'nucleotide';
>> $options{'-query'} = "Pongo [ORGN]"; # Orangutans
>>
>>
>> my $query_obj = Bio::DB::Query::GenBank->new(%options);
>> my $total = $query_obj->count;
>>
>> my $gb_obj = Bio::DB::GenBank->new();
>> my $stream_obj = $gb_obj->get_Stream_by_query($query_obj);
>>
>> # Restrict info to just what I'll be using. No sequence necessary.
>> my $builder = $stream_obj->sequence_builder();
>> $builder->want_none();
>> $builder->add_wanted_slot('species','length','accession');
>>
>> my $c = 0;
>>
>> for (1 .. $total) {
>> eval {
>> my $seq_obj = $stream_obj->next_seq;
>> my $flavor = $seq_obj->species;
>> print $c, "\t", $flavor->scientific_name, " (", $flavor->id, ")\t",
>> $seq_obj->length, "\t", $seq_obj->accession, "\n";
>> };
>>
>> if ($@) {
>> print $!, '\n';
>> }
>>
>> # Pause for a little over a third of a second
>> select(undef, undef, undef, 0.35);
>>
>> $c++;
>> }
>>
>>
>>
>> --
>> View this message in context: http://old.nabble.com/get_Stream_by_query-Terminates-Prematurely-tp28506482p28506482.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list