[Bioperl-l] How to download EST files via bioperl script?

Xing Hu xing.y.hu at gmail.com
Tue Jul 10 13:29:36 UTC 2007


Thanks you guys.

I had to confess that how stupid I was. The easiest way seems to be the 
way using NCBI Taxonomy Browser which suggested by alex. As a matter of 
fact, I knew that but I thought it was necessary to have all items 
selected before pressing save to launch download. So I was desperate to 
find a button that could achieve that without hundreds of thousands of 
clicking by me. "What about select none of those items at all?" -- This 
idea finally came to me after days of struggling and the problem was solved.

Xing



Chris Fields wrote:
> Caveat: if you have millions of ESTs please consider NOT using my 
> eutil script below or NCBI Batch Entrez, which would repeatedly hit 
> the NCBI server thousands of times.  At least try looking for other 
> ways to retrieve the data you want (ftp, organism-specific resources 
> like Ensembl, so on), or run any scripts or data retrieval in off 
> hours so you don't overtax the NCBI server.
>
> There is a way you can use BioPerl if you don't mind living on the 
> bleeding edge by using bioperl-live (core code from CVS).  I have been 
> working on a set of modules for the last year (Bio::DB::EUtilities) 
> which interact with all the various eutils for building data pipelines 
> which uses the NCBI CGI interface.  You could possibly retrieve all 
> relevant ESTs using a variation of the example script here:
>
> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esearch-.3Eefetch
>
> Note that the code examples do NOT work with rel. 1.5.2 code as the 
> API has changed quite a bit; I'm working to rectify some of that.
>
> The script I would use is below.  It retrieves batches of 500 
> sequences (in fasta format) at a time, for a total of 10000 max seq 
> records, saving the raw record data directly to a file (appending as 
> you go along).  I added an eval block to check the server status and 
> redo the call up to 4 times before giving up completely.  Using eval 
> this way hasn't been extensively tested but should work.
>
> ---------------------------------------
>
> use Bio::DB::EUtilities;
>
> my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
>                                        -db => 'nucest',
>                                        -term => 'txid3702',
>                                        -usehistory => 'y',
>                                        -keep_histories => 1);
>
> my $count = $factory->get_count;
>
> print "Count: $count\n";
>
> if (my $hist = $factory->next_History) {
>     print "History returned\n";
>     # note db carries over from above
>     $factory->set_parameters(-eutil => 'efetch',
>                              -rettype => 'fasta',
>                              -history => $hist);
>     my ($retmax, $retstart) = (500,0);
>     my $retry = 1;
>     my $maxcount = $count < 10000 ? $count : 10000; # set max # seq 
> records to return
>     RETRIEVE_SEQS:
>     while ($retstart < $maxcount) {
>         print "Returning from ",$retstart+1," to 
> ",$retstart+$retmax,"\n";
>         $factory->set_parameters(-retmax => $retmax,
>                                 -retstart => $retstart);
>         # check in case of server error
>         eval{
>             $factory->get_Response(-file => ">>ESTs.fas");
>         };
>         if ($@) {
>             die "Server error: $@.  Try again later" if $retry == 5;
>             print STDERR "Server error, redo #$retry\n";
>             $retry++ && redo RETRIEVE_SEQS;
>         }
>         $retstart += $retmax;
>     }
> }
>
>
> ---------------------------------------
>
>
> chris
>
> On Jul 9, 2007, at 7:25 AM, Alexander Kozik wrote:
>
>> To download genomic sequences or ESTs for any organism (in various
>> formats) you can use NCBI Taxonomy Browser:
>> http://www.ncbi.nlm.nih.gov/Taxonomy/
>>
>> you can use taxonomy id to access different organisms, Arabidopsis for
>> example (3702):
>> http://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotide&cmd=Search&dopt=DocSum&term=txid3702 
>>
>>
>> or by direct web link:
>> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&name=Arabidopsis+thaliana&lvl=0&srchmode=1 
>>
>>
>> assembled genomes can be accessed via ftp:
>> ftp://ftp.ncbi.nih.gov/genomes/
>>
>> To download large amount of selected sequences (ESTs for example) you
>> can use batch Entrez:
>> http://www.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html
>> http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide
>> (select EST for EST, it's critical)
>>
>> It seems, to solve the problem you describe, you don't need to use
>> bioperl. NCBI GenBank Entrez provides all necessary tools to work on
>> these simple and frequent tasks.
>>
>> -Alex
>>
>> --Alexander Kozik
>> Bioinformatics Specialist
>> Genome and Biomedical Sciences Facility
>> 451 East Health Sciences Drive
>> University of California
>> Davis, CA 95616-8816
>> Phone: (530) 754-9127
>> email#1: akozik at atgc.org
>> email#2: akozik at gmail.com
>> web: http://www.atgc.org/
>>
>>
>>
>> Xing Hu wrote:
>>> Hi friends,
>>>
>>>     I wrote a script for getting genomic sequence file from GenBank. To
>>> fulfill that target, I used DB::GenBank module to get the sequence via
>>> get_Seq_by_acc, and it works well. But this time, facing enormous 
>>> amount
>>> of ESTs, I have no idea how to download them swiftly and elegantly.
>>>
>>>     PROBLEM DESCRIPTION:
>>>     goal: download all EST files of a specific species from GenBank, 
>>> say
>>> Arabidopsis Thaliana or Oryza sativa(rice).
>>>     other: whether all of ESTs are in a single file or separatedly
>>> placed does not matter.
>>>
>>>     Can I use a bioperl script to achieve that? And How? I really
>>> appreciate.
>>>
>>> Xing.
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
>





More information about the Bioperl-l mailing list