[Bioperl-l] How to download EST files via bioperl script?
Xing Hu
xing.y.hu at gmail.com
Tue Jul 10 13:29:36 UTC 2007
Thanks you guys.
I had to confess that how stupid I was. The easiest way seems to be the
way using NCBI Taxonomy Browser which suggested by alex. As a matter of
fact, I knew that but I thought it was necessary to have all items
selected before pressing save to launch download. So I was desperate to
find a button that could achieve that without hundreds of thousands of
clicking by me. "What about select none of those items at all?" -- This
idea finally came to me after days of struggling and the problem was solved.
Xing
Chris Fields wrote:
> Caveat: if you have millions of ESTs please consider NOT using my
> eutil script below or NCBI Batch Entrez, which would repeatedly hit
> the NCBI server thousands of times. At least try looking for other
> ways to retrieve the data you want (ftp, organism-specific resources
> like Ensembl, so on), or run any scripts or data retrieval in off
> hours so you don't overtax the NCBI server.
>
> There is a way you can use BioPerl if you don't mind living on the
> bleeding edge by using bioperl-live (core code from CVS). I have been
> working on a set of modules for the last year (Bio::DB::EUtilities)
> which interact with all the various eutils for building data pipelines
> which uses the NCBI CGI interface. You could possibly retrieve all
> relevant ESTs using a variation of the example script here:
>
> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esearch-.3Eefetch
>
> Note that the code examples do NOT work with rel. 1.5.2 code as the
> API has changed quite a bit; I'm working to rectify some of that.
>
> The script I would use is below. It retrieves batches of 500
> sequences (in fasta format) at a time, for a total of 10000 max seq
> records, saving the raw record data directly to a file (appending as
> you go along). I added an eval block to check the server status and
> redo the call up to 4 times before giving up completely. Using eval
> this way hasn't been extensively tested but should work.
>
> ---------------------------------------
>
> use Bio::DB::EUtilities;
>
> my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
> -db => 'nucest',
> -term => 'txid3702',
> -usehistory => 'y',
> -keep_histories => 1);
>
> my $count = $factory->get_count;
>
> print "Count: $count\n";
>
> if (my $hist = $factory->next_History) {
> print "History returned\n";
> # note db carries over from above
> $factory->set_parameters(-eutil => 'efetch',
> -rettype => 'fasta',
> -history => $hist);
> my ($retmax, $retstart) = (500,0);
> my $retry = 1;
> my $maxcount = $count < 10000 ? $count : 10000; # set max # seq
> records to return
> RETRIEVE_SEQS:
> while ($retstart < $maxcount) {
> print "Returning from ",$retstart+1," to
> ",$retstart+$retmax,"\n";
> $factory->set_parameters(-retmax => $retmax,
> -retstart => $retstart);
> # check in case of server error
> eval{
> $factory->get_Response(-file => ">>ESTs.fas");
> };
> if ($@) {
> die "Server error: $@. Try again later" if $retry == 5;
> print STDERR "Server error, redo #$retry\n";
> $retry++ && redo RETRIEVE_SEQS;
> }
> $retstart += $retmax;
> }
> }
>
>
> ---------------------------------------
>
>
> chris
>
> On Jul 9, 2007, at 7:25 AM, Alexander Kozik wrote:
>
>> To download genomic sequences or ESTs for any organism (in various
>> formats) you can use NCBI Taxonomy Browser:
>> http://www.ncbi.nlm.nih.gov/Taxonomy/
>>
>> you can use taxonomy id to access different organisms, Arabidopsis for
>> example (3702):
>> http://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotide&cmd=Search&dopt=DocSum&term=txid3702
>>
>>
>> or by direct web link:
>> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&name=Arabidopsis+thaliana&lvl=0&srchmode=1
>>
>>
>> assembled genomes can be accessed via ftp:
>> ftp://ftp.ncbi.nih.gov/genomes/
>>
>> To download large amount of selected sequences (ESTs for example) you
>> can use batch Entrez:
>> http://www.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html
>> http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide
>> (select EST for EST, it's critical)
>>
>> It seems, to solve the problem you describe, you don't need to use
>> bioperl. NCBI GenBank Entrez provides all necessary tools to work on
>> these simple and frequent tasks.
>>
>> -Alex
>>
>> --Alexander Kozik
>> Bioinformatics Specialist
>> Genome and Biomedical Sciences Facility
>> 451 East Health Sciences Drive
>> University of California
>> Davis, CA 95616-8816
>> Phone: (530) 754-9127
>> email#1: akozik at atgc.org
>> email#2: akozik at gmail.com
>> web: http://www.atgc.org/
>>
>>
>>
>> Xing Hu wrote:
>>> Hi friends,
>>>
>>> I wrote a script for getting genomic sequence file from GenBank. To
>>> fulfill that target, I used DB::GenBank module to get the sequence via
>>> get_Seq_by_acc, and it works well. But this time, facing enormous
>>> amount
>>> of ESTs, I have no idea how to download them swiftly and elegantly.
>>>
>>> PROBLEM DESCRIPTION:
>>> goal: download all EST files of a specific species from GenBank,
>>> say
>>> Arabidopsis Thaliana or Oryza sativa(rice).
>>> other: whether all of ESTs are in a single file or separatedly
>>> placed does not matter.
>>>
>>> Can I use a bioperl script to achieve that? And How? I really
>>> appreciate.
>>>
>>> Xing.
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
>
More information about the Bioperl-l
mailing list