[Bioperl-l] Finding coding sequences corresponding to protein sequences
Warren Gallin
wgallin at ualberta.ca
Mon Jan 22 18:46:13 UTC 2007
I have a set of gi numbers for protein sequences and I am trying to
write a script that will return the corresponding nucleic acid coding
sequences (CDS only) for each protein record.
Currently the script grabs the protein sequence SeqIO object from
GENPEPT and scans through the primary tags looking for CDS.
Problem number 1 is that not all of the protein records have a CDS
annotation that links to a nucleic acid sequence record (the first
example in my set of records is gi1345813). So, when that happens,
is there a straightforward way to find all protein records that have
amino acid sequences identical to the current one, aside from running
BLAST and then parsing the sequence identifiers on the alignment of
the first hit?
Problem number 2 is that sometimes the accession number for the
corresponding nucleotide sequence in the CDS-tagged feature is a
RefSeq entry, which causes my script to throw a fatal exception (the
first example in my set of records is gi31543024). Is there a
straightforward way to get to a non-RefSeq record given a RefSeq
accession number? Also, is there some way of evaluating the call to
GENBANK to determine if the record being requested is a RefSeq record
and bypass the call?
I would appreciate pointers to relevant documentation or examples so
I can learn more about the general issues that are involved, or
specific advice on how to make this kind of a script work.
I'm using Bioperl 1.5.2 on a OS X server.
Thanks,
Warren Gallin
More information about the Bioperl-l
mailing list