[Bioperl-l] Question regarding NR database

Thu Mar 20 12:50:43 EST 2003

Kerr and Jason,

>(see the script in scripts/seq/extract_feature_seq )

Also see the script examples/seq/extract_cds.pl.

Brian O.

-----Original Message-----
From: bioperl-l-bounces at bioperl.org [mailto:bioperl-l-bounces at bioperl.org]On
Behalf Of Jason Stajich
Sent: Thursday, March 20, 2003 12:12 PM
To: Kerr Wall
Cc: bioperl-l at bioperl.org
Subject: Re: [Bioperl-l] Question regarding NR database

On Thu, 20 Mar 2003, Kerr Wall wrote:

> Hi,
>
> I am somewhat new to Bioperl and have checked the mailing list archive
with
> no luck.  I am trying to come up with a way to get all of the nucleotide
cds
> sequences that are in the NR protein database.  There are currently
> 1,363,299 protein sequences in NCBI's NR database file.  I would like to
get
> a nucleotide sequence for each of these protein sequences.
>
> I have devised a way to use Entrez to get the sequences but I am wondering
> if there is an easier way to do this.  I can retrieve the html file for
each
> protein sequence in NR using Entrez, then parse out the CDS html link fore
> each protein, then find the nucleotide sequence file in Entrez, and
finally
> parse out the coding region nucleotide sequence.  This would require
> 1,363,299 x 2 requests to Entrez for such a job.  Is it ok to hammer the
> Entrez server this many times?

NO!

Unfortunately I think NCBI has only organized things from the NT->AA
perspective since that is typically how the data is generated and put into
the system.

For each of the sequences there is the originating NT sequence accession
in parens - you write a little regexp to get that from $seq->description()

You'll have to get the NT data download genbank -- from the ftpsite in
/genbank, You'll want gbmam*, gbpln*, gbpri*, gbrod*,gbvrl*

Index these files with Bio::Index::GenBank or Bio::DB::Flat

Now write a while loop which reads each peptide sequence from the file,
gets the originating NT sequence - extract that sequence from the Index
via the get_Seq_by_acc call.  Now get all the CDSes from the NT record
(see the script in scripts/seq/extract_feature_seq ), (> 1 one my
exist in the file) you need to match the correct CDS to the peptide via
the protein_id field or by location.  You can get the spliced sequence
from (see the extract_feature_seq script) for the feature and translate it
to make sure they match (would be interested to know how many don't!).

You write out all the CDS sequences to one file and you're done.  You
might make sure the primary_id for the CDS sequence is somehow linked to
your peptide sequence so you can easily look up one from the other when
you indexed both FASTA files.

Hope that gets you started, if you've not done any bioperl programming you
might have a look at the howtos and the tutorial first.  Should be less
than a 100 lines of code I would expect.

>
> I've downloaded the NT database as well but not sure how to link the two
> files.  Hopefully someone has already had to do this and has thought about
> the logic to accomplish such a job.
>
> Thanks,
>
> Kerr
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu
_______________________________________________
Bioperl-l mailing list
Bioperl-l at bioperl.org
http://bioperl.org/mailman/listinfo/bioperl-l