[Bioperl-l] How to download EST files via bioperl script?

Tue Jul 10 17:08:35 UTC 2007

Hi Alberto,

Yes, I know that there is only choice for showing no more than 500 
entries on the NCBI website. However, I completely ignored that (doesn't 
mean that I have not seen that), and pulled down the "send to" and chose 
"file". Then a small window popped up, after saying yes to that, the 
downloading started. You might ask me how I know that it was not a batch 
of only 5 (default selection) or 500 ESTs? To be honest, I don't know at 
the first time. But the download has accumulated to millions bytes since 
then(due to my bad network condition, I have no idea when it will reach 
the end), and that doesn't look like a little batch of ESTs less than 
one thousand. Actually, I wrote a script to count the sequences within 
the temporary file and got a number much bigger than ten thousand. So I 
guess it works.

BTW, I never thought Bio::DB::Genbank can do that! Again, thanks you guys!

Xing

Alberto Davila wrote:
> Hi Xing,
>
> Unfortunately that did not work for me... there are 5133 T. brucei ESTs 
> (http://www.ncbi.nlm.nih.gov/sites/entrez?term=txid5691[Organism:exp]&cmd=Search&db=nucest&QueryKey=8) 
> and 13971 from T. cruzi 
> (http://www.ncbi.nlm.nih.gov/sites/entrez?term=txid5693[Organism:exp]&cmd=Search&db=nucest&QueryKey=11) 
>   that I cannot download at once in GenBank format... even when I select 
> "GenBank" format in the Display menu I can only see and get/download 500 
> ESTs each time...
>
> I also downloaded all ESTs from GenBank (a pity there are not subsets of 
> them !) but merging all them generate a file bigger than 120GB to be 
> processed...
>
> Just asked Diogo (my student) to give a try to the script sent by Chris 
> Fields.. so finger crossed ;-)
>
> Cheers, Alberto
>
>
> Xing Hu wrote:
>   
>> Thanks you guys.
>>
>> I had to confess that how stupid I was. The easiest way seems to be the 
>> way using NCBI Taxonomy Browser which suggested by alex. As a matter of 
>> fact, I knew that but I thought it was necessary to have all items 
>> selected before pressing save to launch download. So I was desperate to 
>> find a button that could achieve that without hundreds of thousands of 
>> clicking by me. "What about select none of those items at all?" -- This 
>> idea finally came to me after days of struggling and the problem was solved.
>>
>> Xing
>>
>>
>>
>> Chris Fields wrote:
>>     
>>> Caveat: if you have millions of ESTs please consider NOT using my 
>>> eutil script below or NCBI Batch Entrez, which would repeatedly hit 
>>> the NCBI server thousands of times.  At least try looking for other 
>>> ways to retrieve the data you want (ftp, organism-specific resources 
>>> like Ensembl, so on), or run any scripts or data retrieval in off 
>>> hours so you don't overtax the NCBI server.
>>>
>>> There is a way you can use BioPerl if you don't mind living on the 
>>> bleeding edge by using bioperl-live (core code from CVS).  I have been 
>>> working on a set of modules for the last year (Bio::DB::EUtilities) 
>>> which interact with all the various eutils for building data pipelines 
>>> which uses the NCBI CGI interface.  You could possibly retrieve all 
>>> relevant ESTs using a variation of the example script here:
>>>
>>> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esearch-.3Eefetch
>>>
>>> Note that the code examples do NOT work with rel. 1.5.2 code as the 
>>> API has changed quite a bit; I'm working to rectify some of that.
>>>
>>> The script I would use is below.  It retrieves batches of 500 
>>> sequences (in fasta format) at a time, for a total of 10000 max seq 
>>> records, saving the raw record data directly to a file (appending as 
>>> you go along).  I added an eval block to check the server status and 
>>> redo the call up to 4 times before giving up completely.  Using eval 
>>> this way hasn't been extensively tested but should work.
>>>
>>> ---------------------------------------
>>>
>>> use Bio::DB::EUtilities;
>>>
>>> my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
>>>                                        -db => 'nucest',
>>>                                        -term => 'txid3702',
>>>                                        -usehistory => 'y',
>>>                                        -keep_histories => 1);
>>>
>>> my $count = $factory->get_count;
>>>
>>> print "Count: $count\n";
>>>
>>> if (my $hist = $factory->next_History) {
>>>     print "History returned\n";
>>>     # note db carries over from above
>>>     $factory->set_parameters(-eutil => 'efetch',
>>>                              -rettype => 'fasta',
>>>                              -history => $hist);
>>>     my ($retmax, $retstart) = (500,0);
>>>     my $retry = 1;
>>>     my $maxcount = $count < 10000 ? $count : 10000; # set max # seq 
>>> records to return
>>>     RETRIEVE_SEQS:
>>>     while ($retstart < $maxcount) {
>>>         print "Returning from ",$retstart+1," to 
>>> ",$retstart+$retmax,"\n";
>>>         $factory->set_parameters(-retmax => $retmax,
>>>                                 -retstart => $retstart);
>>>         # check in case of server error
>>>         eval{
>>>             $factory->get_Response(-file => ">>ESTs.fas");
>>>         };
>>>         if ($@) {
>>>             die "Server error: $@.  Try again later" if $retry == 5;
>>>             print STDERR "Server error, redo #$retry\n";
>>>             $retry++ && redo RETRIEVE_SEQS;
>>>         }
>>>         $retstart += $retmax;
>>>     }
>>> }
>>>
>>>
>>> ---------------------------------------
>>>
>>>
>>> chris
>>>
>>> On Jul 9, 2007, at 7:25 AM, Alexander Kozik wrote:
>>>
>>>       
>>>> To download genomic sequences or ESTs for any organism (in various
>>>> formats) you can use NCBI Taxonomy Browser:
>>>> http://www.ncbi.nlm.nih.gov/Taxonomy/
>>>>
>>>> you can use taxonomy id to access different organisms, Arabidopsis for
>>>> example (3702):
>>>> http://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotide&cmd=Search&dopt=DocSum&term=txid3702 
>>>>
>>>>
>>>> or by direct web link:
>>>> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&name=Arabidopsis+thaliana&lvl=0&srchmode=1 
>>>>
>>>>
>>>> assembled genomes can be accessed via ftp:
>>>> ftp://ftp.ncbi.nih.gov/genomes/
>>>>
>>>> To download large amount of selected sequences (ESTs for example) you
>>>> can use batch Entrez:
>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html
>>>> http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide
>>>> (select EST for EST, it's critical)
>>>>
>>>> It seems, to solve the problem you describe, you don't need to use
>>>> bioperl. NCBI GenBank Entrez provides all necessary tools to work on
>>>> these simple and frequent tasks.
>>>>
>>>> -Alex
>>>>
>>>> --Alexander Kozik
>>>> Bioinformatics Specialist
>>>> Genome and Biomedical Sciences Facility
>>>> 451 East Health Sciences Drive
>>>> University of California
>>>> Davis, CA 95616-8816
>>>> Phone: (530) 754-9127
>>>> email#1: akozik at atgc.org
>>>> email#2: akozik at gmail.com
>>>> web: http://www.atgc.org/
>>>>
>>>>
>>>>
>>>> Xing Hu wrote:
>>>>         
>>>>> Hi friends,
>>>>>
>>>>>     I wrote a script for getting genomic sequence file from GenBank. To
>>>>> fulfill that target, I used DB::GenBank module to get the sequence via
>>>>> get_Seq_by_acc, and it works well. But this time, facing enormous 
>>>>> amount
>>>>> of ESTs, I have no idea how to download them swiftly and elegantly.
>>>>>
>>>>>     PROBLEM DESCRIPTION:
>>>>>     goal: download all EST files of a specific species from GenBank, 
>>>>> say
>>>>> Arabidopsis Thaliana or Oryza sativa(rice).
>>>>>     other: whether all of ESTs are in a single file or separatedly
>>>>> placed does not matter.
>>>>>
>>>>>     Can I use a bioperl script to achieve that? And How? I really
>>>>> appreciate.
>>>>>
>>>>> Xing.
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>           
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>         
>>> Christopher Fields
>>> Postdoctoral Researcher
>>> Lab of Dr. Robert Switzer
>>> Dept of Biochemistry
>>> University of Illinois Urbana-Champaign
>>>
>>>
>>>       
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>