[Bioperl-l] RE: Bioperl-l Digest, Vol 3, Issue 45
Clancy, Kevin
kclancy at informaxinc.com
Thu Mar 20 13:34:10 EST 2003
Kerr,
If you look under the ncbi's ftp site - ftp://ftp.ncbi.nih.gov/blast/db you will see both the nt and nr sequece collections. You could simply download these and use these as sequence sources. nr should be proteins - nt should be nucleic acid sequences.
NCBI will probably not appreciate your hitting the entrez server many thousands of times - your sys admin probably would be a bit miffed as well, particularly if you are affecting other services while doing this.
You could try a couple of approaches - the NCBI hasd an entrez service that takes lists of gi numbers and allows you to download them as a batch. You might try this as an approach. The other alternative is to simply use the nr and nt databases (which are in fasta format) and when you identify sequences that you are interested, then retrieve these via entrez for the fully annotated sequences. Both these techniques are a bit more friendly than a mass query of ncbi.
A final approach is download GenBank (this will take a while) and then query it locally using fastacmd or some other home grown tool. For instance the bioperl faq does deal with querying and getting sequences from an indexed database. If you have access to EMBOSS, this also has an indexing facility that you can access via BioPerl's extensions.
Hope this helps.
kevin clancy
-----Original Message-----
From: bioperl-l-request at bioperl.org [mailto:bioperl-l-request at bioperl.org]
Sent: Thu 3/20/2003 12:02 PM
To: bioperl-l at bioperl.org
Cc:
Subject: Bioperl-l Digest, Vol 3, Issue 45
Send Bioperl-l mailing list submissions to
bioperl-l at bioperl.org
To subscribe or unsubscribe via the World Wide Web, visit
http://bioperl.org/mailman/listinfo/bioperl-l
or, via email, send a message with subject or body 'help' to
bioperl-l-request at bioperl.org
You can reach the person managing the list at
bioperl-l-owner at bioperl.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioperl-l digest..."
Today's Topics:
1. Question regarding NR database (Kerr Wall)
----------------------------------------------------------------------
Message: 1
Date: Thu, 20 Mar 2003 11:21:49 -0500
From: Kerr Wall <pkerrwall at psu.edu>
Subject: [Bioperl-l] Question regarding NR database
To: <bioperl-l at bioperl.org>
Message-ID: <BA9F54CD.8523%pkerrwall at psu.edu>
Content-Type: text/plain; charset="US-ASCII"
Hi,
I am somewhat new to Bioperl and have checked the mailing list archive with
no luck. I am trying to come up with a way to get all of the nucleotide cds
sequences that are in the NR protein database. There are currently
1,363,299 protein sequences in NCBI's NR database file. I would like to get
a nucleotide sequence for each of these protein sequences.
I have devised a way to use Entrez to get the sequences but I am wondering
if there is an easier way to do this. I can retrieve the html file for each
protein sequence in NR using Entrez, then parse out the CDS html link fore
each protein, then find the nucleotide sequence file in Entrez, and finally
parse out the coding region nucleotide sequence. This would require
1,363,299 x 2 requests to Entrez for such a job. Is it ok to hammer the
Entrez server this many times?
I've downloaded the NT database as well but not sure how to link the two
files. Hopefully someone has already had to do this and has thought about
the logic to accomplish such a job.
Thanks,
Kerr
------------------------------
_______________________________________________
Bioperl-l mailing list
Bioperl-l at bioperl.org
http://bioperl.org/mailman/listinfo/bioperl-l
End of Bioperl-l Digest, Vol 3, Issue 45
****************************************
More information about the Bioperl-l
mailing list