[Bioperl-l] genbank

Tue Nov 30 02:54:50 UTC 2010

On 11/30/2010 10:19 AM, Jason Stajich wrote:
> Dimitar -
>
> In terms of your question - a GenBank db query previously (ie 4-5 
> years ago when this was written) WOULD NOT return a sequence if a 
> GenPept ID was specified so we had to have a separate module for 
> GenBank and GenPept db querying since there was a different set of 
> parameters - I think that changed so that most of the queries can run 
> through GenBank
>
> I see that must have been improved at NCBI.  For the record if you 
> want the full GenPept record with features and annotations you just 
> request a different db, in this case 'gb' for genbank instead of the 
> fasta source
> As in: http://gist.github.com/721012
>
> But maybe you already figured it out?
>
> -jason
> Chris Fields wrote, On 11/29/10 6:39 AM:
>> On Nov 29, 2010, at 3:35 AM, Dimitar Kenanov wrote:
>>
>>> Hi again,
>>> it seems that when i download (with 'download_query_genbank.pl') the 
>>> whole proteome from NCBI in fasta format it is first being 
>>> downloaded and from it is being created some kind of 
>>> SeqFastaSpeedFactory and after that from it is being copied to the 
>>> output file. But i want to download and write to output file one by 
>>> one so i can see the download progress(which is working for genbank 
>>> data).
>>>
>>> Its frustrating :)
>>>
>>> Any ideas where to look for solution
>>> Cheers
>>> Dimitar
>>
>> You can't do this with the default script, but you can use a modified 
>> version and, where you are retrieving a sequence stream, in the last 
>> four lines:
>>
>> my $stream = $dbh->get_Stream_by_query($query);
>> while( my $seq = $stream->next_seq ) {
>>     $out->write_seq($seq);
>> }
>>
>> insert an iterator in the loop that indicates progress.  Realize the 
>> sequence data is processed through Bio::SeqIO, so it won't be exactly 
>> the same as what is retrieved from GenBank, but it should be very close.
>>
>> If you want raw sequence, you can use Bio::DB::EUtilities, but it's a 
>> bit more complicated.
>>
>> chris
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

Thank you Jason,
i figured that out yes :)
I know am wandering how to get the GI list so i can download by ID or 
ACC and not by query when i download fasta. I have the simple code:
-------------
my $query_str='Caenorhabditis elegans[organism] AND refseq[filter]';
my $query=Bio::DB::Query::GenBank->new(-db=>'protein',
                                        -query=>$query_str);

my $count=$query->count;
my @ids=$query->ids; <------- error msg here
------------
but i get an error msg:

MSG: Id list has been truncated even after maxids requested.

How can i get the ID/ACC? Any idea?

Thank you
Dimitar

-- 
Dimitar Kenanov
Post doctoral fellow
Bioinformatics Institute
A*STAR Singapore
tel: +65 6478 8514
email: dimitark at bii.a-star.edu.sg