[Biopython-dev] Strange Genbank feature description: how should biopython handle this?

Danny Yoo dyoo at acoma.Stanford.EDU
Wed Aug 7 16:46:45 EDT 2002


Hi everyone,


Ok, I fiddling around with the Genbank parser.  In one of my test cases,
there's one particular entry that's very evil.  It comes from AP000423
(GI:5881673), as gene RPS12:

     gene            join(complement(98562..98793),complement(97999..98024),
                     complement(69611..69724),139856..140087,140625..140650)
                     /gene="rps12"




Here's how Biopython is initializing this feature as:

###
type: gene
location: (98561..140650)
ref: None:None
strand: None
qualifiers:
	Key: gene, Value: ['rps12']
Sub-Features
type: gene
location: (98561..98793)
ref: None:None
strand: -1
qualifiers:

type: gene
location: (97998..98024)
ref: None:None
strand: -1
qualifiers:

type: gene
location: (69610..69724)
ref: None:None
strand: -1
qualifiers:

type: gene
location: (139855..140087)
ref: None:None
strand: None
qualifiers:

type: gene
location: (140624..140650)
ref: None:None
strand: None
qualifiers:
###


The LocationParser itself appears to be doing it's job, as I see that:

###
Function('join', [Function('complement', [AbsoluteLocation(None,
Range(Integer(98562), Integer(98793)))]), Function('complement',
[AbsoluteLocation(None, Range(Integer(97999), Integer(98024)))]),
Function('complement', [AbsoluteLocation(None, Range(Integer(69611),
Integer(69724)))]), AbsoluteLocation(None, Range(Integer(139856),
Integer(140087))), AbsoluteLocation(None, Range(Integer(140625),
Integer(140650)))])
###



Having a strand of 'None' doesn't appear to be right.  I've been staring
at 'Bio.GenBank.__init__.py' for a while, and it appears that the default
value for the strand isn't set unless the self._seq_type is equal to
"DNA".  I don't quite understand all of the code yet, but the following
change appears to fix this particular case:


Index: Bio/GenBank/__init__.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/GenBank/__init__.py,v
retrieving revision 1.29
diff -u -r1.29 __init__.py
--- Bio/GenBank/__init__.py	2002/04/16 15:45:26	1.29
+++ Bio/GenBank/__init__.py	2002/08/07 20:43:28
@@ -636,8 +636,9 @@

         # assume positive strand to start with if we have DNA. The
         # complement in the location will change this later.
-        if self._seq_type == "DNA":
-            self._cur_feature.strand = 1
+##         if self._seq_type == "DNA":
+##             self._cur_feature.strand = 1
+        self._cur_feature.strand = 1

     def location(self, content):
         """Parse out location information from the location string.
@@ -735,7 +736,7 @@
             new_sub_feature.ref = cur_feature.ref
             new_sub_feature.ref_db = cur_feature.ref_db
             new_sub_feature.strand = cur_feature.strand
-
+            assert(new_sub_feature.strand in (1, -1)) ## debug
             # set the information for the inner element
             self._set_location_info(inner_element, new_sub_feature)



What's the right way of fixing this problem?  Thank you!




More information about the Biopython-dev mailing list