[BioPython] Sequence Annotation: sequence numbering
Jeffrey Chang
jchang@SMI.Stanford.EDU
Wed, 27 Jun 2001 22:54:08 -0700
At 4:51 PM +0300 6/26/01, Iddo Friedberg wrote:
>: [Iddo]
>: >I would like to start a discussion about the annotation of protein
>: >sequence numbering in Biopython. You are probably all aware of the fact
>[Leighton]
>: My own opinion tends to numbering all PDB/FSSP submissions in line with
>: their Swiss-Prot sequences, but that doesn't exactly give us a quick fix,
>: does it?
>
>Yes, that would be a good solution for the positional numbering problem.
>And a SwissProt - PDB mapper will be extremely useful to the
>sequence-structure community.
>
>However, it is not really within Biopython's scope to do so. (If anyone
>knows of such a database, please let us know!
>Given the following sequence & numberings:
>
>sequence A C R L M P
>PDB 1 2 - 4 5 5A
>SwissProt 1 2 3 4 5 6
>
>A possible implementation would be:
>
>from Bio import SeqRecord, Seq
>from Bio.Alphabet import Alphabet
>
>my_seq = Seq.Seq('ACRLMP', Alphabet.ProteinAlphabet())
>pdb_positions = [(1,''), (2,''), (None,''), (4,''), (5,''), (5,'A')]
>sp_positions = [1, 2, 3, 4, 5, 6]
>my_seq_rec = SeqRecord.SeqRecord(my_seq)
>my_seq_rec.annotations['pdb_pos'] = pdb_positions
>my_seq_rec.annotations['sp_pos'] = sp_positions
Something like this would work, but it would also be nice to be able
to retrieve sequences based on specific nomenclatures. For example,
I'd expect something like:
my_seq_rec.sp_pos[1]
my_seq_rec.sp_pos[4:6]
to work.
This, however, brings up semantic issues of how to deal with
sequences without numbers:
my_seq_rec.pdb_pos[(1,''):(4, '')]
(or my_seq_rec.pdb_pos["1":"4"])
Would this return AC or ACR?
>Comments on this? General comments? Can this be adapted to the genomic DNA
><--> cDNA problem?
Hmmm... This is tricky. At first, I thought no, because cDNA and
genomic DNA's are different biological entities. However, they are
mappable to one another and could probably be considered a sequence
mapping problem. In that case, you could also make an argument that
something similar could be done for DNA<->protein as well.
Biopython certainly needs a way to handle multiple sequence
numberings. Being able to handle mappings in general would be the
icing on the cake.
Jeff