[Bioperl-l] How to download EST files via bioperl script?
Xing Hu
xing.y.hu at gmail.com
Tue Jul 10 17:08:35 UTC 2007
Hi Alberto,
Yes, I know that there is only choice for showing no more than 500
entries on the NCBI website. However, I completely ignored that (doesn't
mean that I have not seen that), and pulled down the "send to" and chose
"file". Then a small window popped up, after saying yes to that, the
downloading started. You might ask me how I know that it was not a batch
of only 5 (default selection) or 500 ESTs? To be honest, I don't know at
the first time. But the download has accumulated to millions bytes since
then(due to my bad network condition, I have no idea when it will reach
the end), and that doesn't look like a little batch of ESTs less than
one thousand. Actually, I wrote a script to count the sequences within
the temporary file and got a number much bigger than ten thousand. So I
guess it works.
BTW, I never thought Bio::DB::Genbank can do that! Again, thanks you guys!
Xing
Alberto Davila wrote:
> Hi Xing,
>
> Unfortunately that did not work for me... there are 5133 T. brucei ESTs
> (http://www.ncbi.nlm.nih.gov/sites/entrez?term=txid5691[Organism:exp]&cmd=Search&db=nucest&QueryKey=8)
> and 13971 from T. cruzi
> (http://www.ncbi.nlm.nih.gov/sites/entrez?term=txid5693[Organism:exp]&cmd=Search&db=nucest&QueryKey=11)
> that I cannot download at once in GenBank format... even when I select
> "GenBank" format in the Display menu I can only see and get/download 500
> ESTs each time...
>
> I also downloaded all ESTs from GenBank (a pity there are not subsets of
> them !) but merging all them generate a file bigger than 120GB to be
> processed...
>
> Just asked Diogo (my student) to give a try to the script sent by Chris
> Fields.. so finger crossed ;-)
>
> Cheers, Alberto
>
>
> Xing Hu wrote:
>
>> Thanks you guys.
>>
>> I had to confess that how stupid I was. The easiest way seems to be the
>> way using NCBI Taxonomy Browser which suggested by alex. As a matter of
>> fact, I knew that but I thought it was necessary to have all items
>> selected before pressing save to launch download. So I was desperate to
>> find a button that could achieve that without hundreds of thousands of
>> clicking by me. "What about select none of those items at all?" -- This
>> idea finally came to me after days of struggling and the problem was solved.
>>
>> Xing
>>
>>
>>
>> Chris Fields wrote:
>>
>>> Caveat: if you have millions of ESTs please consider NOT using my
>>> eutil script below or NCBI Batch Entrez, which would repeatedly hit
>>> the NCBI server thousands of times. At least try looking for other
>>> ways to retrieve the data you want (ftp, organism-specific resources
>>> like Ensembl, so on), or run any scripts or data retrieval in off
>>> hours so you don't overtax the NCBI server.
>>>
>>> There is a way you can use BioPerl if you don't mind living on the
>>> bleeding edge by using bioperl-live (core code from CVS). I have been
>>> working on a set of modules for the last year (Bio::DB::EUtilities)
>>> which interact with all the various eutils for building data pipelines
>>> which uses the NCBI CGI interface. You could possibly retrieve all
>>> relevant ESTs using a variation of the example script here:
>>>
>>> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esearch-.3Eefetch
>>>
>>> Note that the code examples do NOT work with rel. 1.5.2 code as the
>>> API has changed quite a bit; I'm working to rectify some of that.
>>>
>>> The script I would use is below. It retrieves batches of 500
>>> sequences (in fasta format) at a time, for a total of 10000 max seq
>>> records, saving the raw record data directly to a file (appending as
>>> you go along). I added an eval block to check the server status and
>>> redo the call up to 4 times before giving up completely. Using eval
>>> this way hasn't been extensively tested but should work.
>>>
>>> ---------------------------------------
>>>
>>> use Bio::DB::EUtilities;
>>>
>>> my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
>>> -db => 'nucest',
>>> -term => 'txid3702',
>>> -usehistory => 'y',
>>> -keep_histories => 1);
>>>
>>> my $count = $factory->get_count;
>>>
>>> print "Count: $count\n";
>>>
>>> if (my $hist = $factory->next_History) {
>>> print "History returned\n";
>>> # note db carries over from above
>>> $factory->set_parameters(-eutil => 'efetch',
>>> -rettype => 'fasta',
>>> -history => $hist);
>>> my ($retmax, $retstart) = (500,0);
>>> my $retry = 1;
>>> my $maxcount = $count < 10000 ? $count : 10000; # set max # seq
>>> records to return
>>> RETRIEVE_SEQS:
>>> while ($retstart < $maxcount) {
>>> print "Returning from ",$retstart+1," to
>>> ",$retstart+$retmax,"\n";
>>> $factory->set_parameters(-retmax => $retmax,
>>> -retstart => $retstart);
>>> # check in case of server error
>>> eval{
>>> $factory->get_Response(-file => ">>ESTs.fas");
>>> };
>>> if ($@) {
>>> die "Server error: $@. Try again later" if $retry == 5;
>>> print STDERR "Server error, redo #$retry\n";
>>> $retry++ && redo RETRIEVE_SEQS;
>>> }
>>> $retstart += $retmax;
>>> }
>>> }
>>>
>>>
>>> ---------------------------------------
>>>
>>>
>>> chris
>>>
>>> On Jul 9, 2007, at 7:25 AM, Alexander Kozik wrote:
>>>
>>>
>>>> To download genomic sequences or ESTs for any organism (in various
>>>> formats) you can use NCBI Taxonomy Browser:
>>>> http://www.ncbi.nlm.nih.gov/Taxonomy/
>>>>
>>>> you can use taxonomy id to access different organisms, Arabidopsis for
>>>> example (3702):
>>>> http://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotide&cmd=Search&dopt=DocSum&term=txid3702
>>>>
>>>>
>>>> or by direct web link:
>>>> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&name=Arabidopsis+thaliana&lvl=0&srchmode=1
>>>>
>>>>
>>>> assembled genomes can be accessed via ftp:
>>>> ftp://ftp.ncbi.nih.gov/genomes/
>>>>
>>>> To download large amount of selected sequences (ESTs for example) you
>>>> can use batch Entrez:
>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html
>>>> http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide
>>>> (select EST for EST, it's critical)
>>>>
>>>> It seems, to solve the problem you describe, you don't need to use
>>>> bioperl. NCBI GenBank Entrez provides all necessary tools to work on
>>>> these simple and frequent tasks.
>>>>
>>>> -Alex
>>>>
>>>> --Alexander Kozik
>>>> Bioinformatics Specialist
>>>> Genome and Biomedical Sciences Facility
>>>> 451 East Health Sciences Drive
>>>> University of California
>>>> Davis, CA 95616-8816
>>>> Phone: (530) 754-9127
>>>> email#1: akozik at atgc.org
>>>> email#2: akozik at gmail.com
>>>> web: http://www.atgc.org/
>>>>
>>>>
>>>>
>>>> Xing Hu wrote:
>>>>
>>>>> Hi friends,
>>>>>
>>>>> I wrote a script for getting genomic sequence file from GenBank. To
>>>>> fulfill that target, I used DB::GenBank module to get the sequence via
>>>>> get_Seq_by_acc, and it works well. But this time, facing enormous
>>>>> amount
>>>>> of ESTs, I have no idea how to download them swiftly and elegantly.
>>>>>
>>>>> PROBLEM DESCRIPTION:
>>>>> goal: download all EST files of a specific species from GenBank,
>>>>> say
>>>>> Arabidopsis Thaliana or Oryza sativa(rice).
>>>>> other: whether all of ESTs are in a single file or separatedly
>>>>> placed does not matter.
>>>>>
>>>>> Can I use a bioperl script to achieve that? And How? I really
>>>>> appreciate.
>>>>>
>>>>> Xing.
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>> Christopher Fields
>>> Postdoctoral Researcher
>>> Lab of Dr. Robert Switzer
>>> Dept of Biochemistry
>>> University of Illinois Urbana-Champaign
>>>
>>>
>>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list