[Bioperl-l] How to download EST files via bioperl script?

Mon Jul 9 14:17:23 UTC 2007

Caveat: if you have millions of ESTs please consider NOT using my  
eutil script below or NCBI Batch Entrez, which would repeatedly hit  
the NCBI server thousands of times.  At least try looking for other  
ways to retrieve the data you want (ftp, organism-specific resources  
like Ensembl, so on), or run any scripts or data retrieval in off  
hours so you don't overtax the NCBI server.

There is a way you can use BioPerl if you don't mind living on the  
bleeding edge by using bioperl-live (core code from CVS).  I have  
been working on a set of modules for the last year  
(Bio::DB::EUtilities) which interact with all the various eutils for  
building data pipelines which uses the NCBI CGI interface.  You could  
possibly retrieve all relevant ESTs using a variation of the example  
script here:

http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esearch-.3Eefetch

Note that the code examples do NOT work with rel. 1.5.2 code as the  
API has changed quite a bit; I'm working to rectify some of that.

The script I would use is below.  It retrieves batches of 500  
sequences (in fasta format) at a time, for a total of 10000 max seq  
records, saving the raw record data directly to a file (appending as  
you go along).  I added an eval block to check the server status and  
redo the call up to 4 times before giving up completely.  Using eval  
this way hasn't been extensively tested but should work.

---------------------------------------

use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
                                        -db => 'nucest',
                                        -term => 'txid3702',
                                        -usehistory => 'y',
                                        -keep_histories => 1);

my $count = $factory->get_count;

print "Count: $count\n";

if (my $hist = $factory->next_History) {
     print "History returned\n";
     # note db carries over from above
     $factory->set_parameters(-eutil => 'efetch',
                              -rettype => 'fasta',
                              -history => $hist);
     my ($retmax, $retstart) = (500,0);
     my $retry = 1;
     my $maxcount = $count < 10000 ? $count : 10000; # set max # seq  
records to return
     RETRIEVE_SEQS:
     while ($retstart < $maxcount) {
         print "Returning from ",$retstart+1," to ",$retstart+ 
$retmax,"\n";
         $factory->set_parameters(-retmax => $retmax,
                                 -retstart => $retstart);
         # check in case of server error
         eval{
             $factory->get_Response(-file => ">>ESTs.fas");
         };
         if ($@) {
             die "Server error: $@.  Try again later" if $retry == 5;
             print STDERR "Server error, redo #$retry\n";
             $retry++ && redo RETRIEVE_SEQS;
         }
         $retstart += $retmax;
     }
}

---------------------------------------

chris

On Jul 9, 2007, at 7:25 AM, Alexander Kozik wrote:

> To download genomic sequences or ESTs for any organism (in various
> formats) you can use NCBI Taxonomy Browser:
> http://www.ncbi.nlm.nih.gov/Taxonomy/
>
> you can use taxonomy id to access different organisms, Arabidopsis for
> example (3702):
> http://www.ncbi.nlm.nih.gov/sites/entrez? 
> db=Nucleotide&cmd=Search&dopt=DocSum&term=txid3702
>
> or by direct web link:
> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi? 
> mode=Undef&name=Arabidopsis+thaliana&lvl=0&srchmode=1
>
> assembled genomes can be accessed via ftp:
> ftp://ftp.ncbi.nih.gov/genomes/
>
> To download large amount of selected sequences (ESTs for example) you
> can use batch Entrez:
> http://www.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html
> http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide
> (select EST for EST, it's critical)
>
> It seems, to solve the problem you describe, you don't need to use
> bioperl. NCBI GenBank Entrez provides all necessary tools to work on
> these simple and frequent tasks.
>
> -Alex
>
> -- 
> Alexander Kozik
> Bioinformatics Specialist
> Genome and Biomedical Sciences Facility
> 451 East Health Sciences Drive
> University of California
> Davis, CA 95616-8816
> Phone: (530) 754-9127
> email#1: akozik at atgc.org
> email#2: akozik at gmail.com
> web: http://www.atgc.org/
>
>
>
> Xing Hu wrote:
>> Hi friends,
>>
>>     I wrote a script for getting genomic sequence file from  
>> GenBank. To
>> fulfill that target, I used DB::GenBank module to get the sequence  
>> via
>> get_Seq_by_acc, and it works well. But this time, facing enormous  
>> amount
>> of ESTs, I have no idea how to download them swiftly and elegantly.
>>
>>     PROBLEM DESCRIPTION:
>>     goal: download all EST files of a specific species from  
>> GenBank, say
>> Arabidopsis Thaliana or Oryza sativa(rice).
>>     other: whether all of ESTs are in a single file or separatedly
>> placed does not matter.
>>
>>     Can I use a bioperl script to achieve that? And How? I really
>> appreciate.
>>
>> Xing.
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign