[Bioperl-l] Pulling down data from NCBI

Chris Fields cjfields at illinois.edu
Tue Feb 2 14:06:09 UTC 2010


On Feb 2, 2010, at 7:57 AM, Robert Bradbury wrote:

> 
> What species/chromosome is this?
> > my $id = 'AAPP01000000[ACCN]';
> 
> 
> One can usually download the genome sequence files, chromosome files, or fasta files from the various FTP sites (almost all of the major genomes have them) or the that would generally be the fastest way to do it.  Or simply look up the Genome sequence at NCBI and download it using a web browser.  There is standard documentation on how to convert genome sequences into fasta files.

Yes, Robert, that is true.  This is just an example sequence file demonstrating the problem; retrieving the sequences via FTP would be more efficient.  However, if one had hundreds of these, and the IDs weren't known ahead of time (they cam from a prior analysis), then this wouldn't be the case.  And this is a common issue.`

> If you are looking for the "big" genomes which may not be in NCBI yet, go to the Broad Institute.  Some bacterial sequences may still only be at TIGR or JGI until they migrate upward.
> 
> But using the standard "system" utilities to do this is usually a far better way to do this then wrestling in BioPerl.  FTP and Web Servers are written in C or C++ which is compiled and much more efficient than an interpreted language like Perl or Java.
> 
> Robert

Actually, most if not all the web services stuff is in Java, and (for many genome sites) they use other languages besides C/C++ (Perl, Python, Ruby, etc).  Ensembl, FlyBase, etc come to mind.

chris



More information about the Bioperl-l mailing list