[Bioperl-l] get_Stream_by_query Terminates Prematurely

Mon May 10 17:07:00 UTC 2010

(addendum added, sent too early)

On May 10, 2010, at 11:58 AM, Chris Fields wrote:

> 500000 sequences is way too many to request, even in a loop.  Under most circumstances this is breaking NCBI's eutils policies:
> 
> http://eutils.ncbi.nlm.nih.gov/#UserSystemRequirements
> 
> so don't be too surprised this is failing (this would be around 1000 queried of 500 sequences per query).  
> 
> You could try pulling down the raw sequence via batch entrez or using Bio::DB::EUtilities (which should die if an error occurs).

But you may still run into issues with eutils at some point, particularly if running this at peak times.

> 
> chris
> 
> On May 9, 2010, at 9:22 PM, bergeycm wrote:
> 
>> 
>> Hi all,
>> 
>> I'm attempting to query GenBank for all sequences' lengths for a given
>> taxon. I'm using get_Stream_by_query(), but only to grab the species,
>> length, and accession. The genus of interest has almost 500,000 GB entries,
>> though, and my code hangs up at odd points in the info-gathering loop.
>> (Often after only 300 or 400 iterations.) The problem is that
>> $stream_obj->next_seq (of Bio::SeqIO::genbank) eventually comes back
>> undefined.
>> 
>> I've tried wrapping the next_seq portion of the code in an eval block, but
>> to no avail. Is there a way to split a query into a bunch of small streams
>> that aren't too much to ask? Or is there a way to pick up a dropped SeqIO
>> stream? I think the connection is timing out and the stream is being lost.
>> Any advice is greatly appreciated, as I'm fairly new to BioPerl.
>> 
>> - bergeycm
>> 
>> 
>> 
>> use Bio::DB::GenBank;
>> use Bio::DB::Query::GenBank;
>> 
>> 
>> # Get general things ready to go for querying GenBank
>> my %options;
>> $options{'-maxids'} = '500000';		# There are presently 460,184 sequences
>> $options{'-db'} = 'nucleotide';
>> $options{'-query'} = "Pongo [ORGN]";	# Orangutans
>> 
>> 
>> my $query_obj = Bio::DB::Query::GenBank->new(%options);	
>> my $total = $query_obj->count;
>> 
>> my $gb_obj = Bio::DB::GenBank->new();
>> my $stream_obj = $gb_obj->get_Stream_by_query($query_obj);
>> 
>> # Restrict info to just what I'll be using. No sequence necessary.
>> my $builder = $stream_obj->sequence_builder();
>> $builder->want_none();
>> $builder->add_wanted_slot('species','length','accession');
>> 
>> my $c = 0;
>> 
>> for (1 .. $total) {
>> 	eval {
>> 		my $seq_obj =  $stream_obj->next_seq;
>> 		my $flavor = $seq_obj->species;			
>> 		print $c, "\t", $flavor->scientific_name, " (", $flavor->id, ")\t",
>> $seq_obj->length, "\t", $seq_obj->accession, "\n";			
>> 	};
>> 
>> 	if ($@) {
>> 		print $!, '\n';
>> 	}
>> 	
>> 	# Pause for a little over a third of a second
>> 	select(undef, undef, undef, 0.35);
>> 	
>> 	$c++;
>> }
>> 
>> 
>> 
>> -- 
>> View this message in context: http://old.nabble.com/get_Stream_by_query-Terminates-Prematurely-tp28506482p28506482.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l