[Bioperl-l] How to download EST files via bioperl script?
Chris Fields
cjfields at uiuc.edu
Mon Jul 9 14:17:23 UTC 2007
Caveat: if you have millions of ESTs please consider NOT using my
eutil script below or NCBI Batch Entrez, which would repeatedly hit
the NCBI server thousands of times. At least try looking for other
ways to retrieve the data you want (ftp, organism-specific resources
like Ensembl, so on), or run any scripts or data retrieval in off
hours so you don't overtax the NCBI server.
There is a way you can use BioPerl if you don't mind living on the
bleeding edge by using bioperl-live (core code from CVS). I have
been working on a set of modules for the last year
(Bio::DB::EUtilities) which interact with all the various eutils for
building data pipelines which uses the NCBI CGI interface. You could
possibly retrieve all relevant ESTs using a variation of the example
script here:
http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esearch-.3Eefetch
Note that the code examples do NOT work with rel. 1.5.2 code as the
API has changed quite a bit; I'm working to rectify some of that.
The script I would use is below. It retrieves batches of 500
sequences (in fasta format) at a time, for a total of 10000 max seq
records, saving the raw record data directly to a file (appending as
you go along). I added an eval block to check the server status and
redo the call up to 4 times before giving up completely. Using eval
this way hasn't been extensively tested but should work.
---------------------------------------
use Bio::DB::EUtilities;
my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
-db => 'nucest',
-term => 'txid3702',
-usehistory => 'y',
-keep_histories => 1);
my $count = $factory->get_count;
print "Count: $count\n";
if (my $hist = $factory->next_History) {
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil => 'efetch',
-rettype => 'fasta',
-history => $hist);
my ($retmax, $retstart) = (500,0);
my $retry = 1;
my $maxcount = $count < 10000 ? $count : 10000; # set max # seq
records to return
RETRIEVE_SEQS:
while ($retstart < $maxcount) {
print "Returning from ",$retstart+1," to ",$retstart+
$retmax,"\n";
$factory->set_parameters(-retmax => $retmax,
-retstart => $retstart);
# check in case of server error
eval{
$factory->get_Response(-file => ">>ESTs.fas");
};
if ($@) {
die "Server error: $@. Try again later" if $retry == 5;
print STDERR "Server error, redo #$retry\n";
$retry++ && redo RETRIEVE_SEQS;
}
$retstart += $retmax;
}
}
---------------------------------------
chris
On Jul 9, 2007, at 7:25 AM, Alexander Kozik wrote:
> To download genomic sequences or ESTs for any organism (in various
> formats) you can use NCBI Taxonomy Browser:
> http://www.ncbi.nlm.nih.gov/Taxonomy/
>
> you can use taxonomy id to access different organisms, Arabidopsis for
> example (3702):
> http://www.ncbi.nlm.nih.gov/sites/entrez?
> db=Nucleotide&cmd=Search&dopt=DocSum&term=txid3702
>
> or by direct web link:
> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?
> mode=Undef&name=Arabidopsis+thaliana&lvl=0&srchmode=1
>
> assembled genomes can be accessed via ftp:
> ftp://ftp.ncbi.nih.gov/genomes/
>
> To download large amount of selected sequences (ESTs for example) you
> can use batch Entrez:
> http://www.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html
> http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide
> (select EST for EST, it's critical)
>
> It seems, to solve the problem you describe, you don't need to use
> bioperl. NCBI GenBank Entrez provides all necessary tools to work on
> these simple and frequent tasks.
>
> -Alex
>
> --
> Alexander Kozik
> Bioinformatics Specialist
> Genome and Biomedical Sciences Facility
> 451 East Health Sciences Drive
> University of California
> Davis, CA 95616-8816
> Phone: (530) 754-9127
> email#1: akozik at atgc.org
> email#2: akozik at gmail.com
> web: http://www.atgc.org/
>
>
>
> Xing Hu wrote:
>> Hi friends,
>>
>> I wrote a script for getting genomic sequence file from
>> GenBank. To
>> fulfill that target, I used DB::GenBank module to get the sequence
>> via
>> get_Seq_by_acc, and it works well. But this time, facing enormous
>> amount
>> of ESTs, I have no idea how to download them swiftly and elegantly.
>>
>> PROBLEM DESCRIPTION:
>> goal: download all EST files of a specific species from
>> GenBank, say
>> Arabidopsis Thaliana or Oryza sativa(rice).
>> other: whether all of ESTs are in a single file or separatedly
>> placed does not matter.
>>
>> Can I use a bioperl script to achieve that? And How? I really
>> appreciate.
>>
>> Xing.
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list