[Bioperl-l] genbank

Chris Fields cjfields at illinois.edu
Tue Nov 30 03:42:14 UTC 2010


On Nov 29, 2010, at 8:54 PM, Dimitar Kenanov wrote:

> On 11/30/2010 10:19 AM, Jason Stajich wrote:
>> Dimitar -
>> 
>> In terms of your question - a GenBank db query previously (ie 4-5 years ago when this was written) WOULD NOT return a sequence if a GenPept ID was specified so we had to have a separate module for GenBank and GenPept db querying since there was a different set of parameters - I think that changed so that most of the queries can run through GenBank
>> 
>> I see that must have been improved at NCBI.  For the record if you want the full GenPept record with features and annotations you just request a different db, in this case 'gb' for genbank instead of the fasta source
>> As in: http://gist.github.com/721012
>> 
>> But maybe you already figured it out?
>> 
>> -jason
>> Chris Fields wrote, On 11/29/10 6:39 AM:
>>> On Nov 29, 2010, at 3:35 AM, Dimitar Kenanov wrote:
>>> 
>>>> Hi again,
>>>> it seems that when i download (with 'download_query_genbank.pl') the whole proteome from NCBI in fasta format it is first being downloaded and from it is being created some kind of SeqFastaSpeedFactory and after that from it is being copied to the output file. But i want to download and write to output file one by one so i can see the download progress(which is working for genbank data).
>>>> 
>>>> Its frustrating :)
>>>> 
>>>> Any ideas where to look for solution
>>>> Cheers
>>>> Dimitar
>>> 
>>> You can't do this with the default script, but you can use a modified version and, where you are retrieving a sequence stream, in the last four lines:
>>> 
>>> my $stream = $dbh->get_Stream_by_query($query);
>>> while( my $seq = $stream->next_seq ) {
>>>    $out->write_seq($seq);
>>> }
>>> 
>>> insert an iterator in the loop that indicates progress.  Realize the sequence data is processed through Bio::SeqIO, so it won't be exactly the same as what is retrieved from GenBank, but it should be very close.
>>> 
>>> If you want raw sequence, you can use Bio::DB::EUtilities, but it's a bit more complicated.
>>> 
>>> chris
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> Thank you Jason,
> i figured that out yes :)
> I know am wandering how to get the GI list so i can download by ID or ACC and not by query when i download fasta. I have the simple code:
> -------------
> my $query_str='Caenorhabditis elegans[organism] AND refseq[filter]';
> my $query=Bio::DB::Query::GenBank->new(-db=>'protein',
>                                       -query=>$query_str);
> 
> my $count=$query->count;
> my @ids=$query->ids; <------- error msg here
> ------------
> but i get an error msg:
> 
> MSG: Id list has been truncated even after maxids requested.
> 
> How can i get the ID/ACC? Any idea?
> 
> Thank you
> Dimitar

Dimitar, 

You are retrieving a huge number of IDs; if you print out $count above, the total is 23906.  Set -maxids to raise the returned default maximum number of IDs higher:

----------------------------------------------
use Bio::DB::Query::GenBank;

my $query_str='Caenorhabditis elegans[organism] AND refseq[filter]';
my $query=Bio::DB::Query::GenBank->new(-maxids => 40000,
                                       -db=>'protein',
                                      -query=>$query_str);

my $count=$query->count;
say "Count: $count";

my @ids=$query->ids;
say scalar(@ids); # equal to $count
----------------------------------------------

Realize, though, you must submit these in batches of ~300 if retrieving sequences (IIRC, Bio::DB::GenBank only uses GET instead of POST, so there is a URL length limit).  Bio::DB::EUtilities can retrieve more, about ~3000 or so in a batch, when using POST.

chris







More information about the Bioperl-l mailing list