[Bioperl-l] Question regarding NR database

Jason Stajich jason at cgt.mc.duke.edu
Thu Mar 20 12:12:18 EST 2003


On Thu, 20 Mar 2003, Kerr Wall wrote:

> Hi,
>
> I am somewhat new to Bioperl and have checked the mailing list archive with
> no luck.  I am trying to come up with a way to get all of the nucleotide cds
> sequences that are in the NR protein database.  There are currently
> 1,363,299 protein sequences in NCBI's NR database file.  I would like to get
> a nucleotide sequence for each of these protein sequences.
>
> I have devised a way to use Entrez to get the sequences but I am wondering
> if there is an easier way to do this.  I can retrieve the html file for each
> protein sequence in NR using Entrez, then parse out the CDS html link fore
> each protein, then find the nucleotide sequence file in Entrez, and finally
> parse out the coding region nucleotide sequence.  This would require
> 1,363,299 x 2 requests to Entrez for such a job.  Is it ok to hammer the
> Entrez server this many times?

NO!

Unfortunately I think NCBI has only organized things from the NT->AA
perspective since that is typically how the data is generated and put into
the system.

For each of the sequences there is the originating NT sequence accession
in parens - you write a little regexp to get that from $seq->description()

You'll have to get the NT data download genbank -- from the ftpsite in
/genbank, You'll want gbmam*, gbpln*, gbpri*, gbrod*,gbvrl*

Index these files with Bio::Index::GenBank or Bio::DB::Flat

Now write a while loop which reads each peptide sequence from the file,
gets the originating NT sequence - extract that sequence from the Index
via the get_Seq_by_acc call.  Now get all the CDSes from the NT record
(see the script in scripts/seq/extract_feature_seq ), (> 1 one my
exist in the file) you need to match the correct CDS to the peptide via
the protein_id field or by location.  You can get the spliced sequence
from (see the extract_feature_seq script) for the feature and translate it
to make sure they match (would be interested to know how many don't!).

You write out all the CDS sequences to one file and you're done.  You
might make sure the primary_id for the CDS sequence is somehow linked to
your peptide sequence so you can easily look up one from the other when
you indexed both FASTA files.


Hope that gets you started, if you've not done any bioperl programming you
might have a look at the howtos and the tutorial first.  Should be less
than a 100 lines of code I would expect.

>
> I've downloaded the NT database as well but not sure how to link the two
> files.  Hopefully someone has already had to do this and has thought about
> the logic to accomplish such a job.
>
> Thanks,
>
> Kerr
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list