[Biopython-dev] [Bug 2655] New: Sorting sub-features in BioSeq.py can return corrupted feature

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Nov 13 15:59:02 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2655

           Summary: Sorting sub-features in BioSeq.py can return corrupted
                    feature
           Product: Biopython
           Version: 1.49b
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: cymon.cox at gmail.com


BioSeq.py retrieves SeqFeatures from a BioSQL database and sorts both the
features and any subfeatures. The first sort is superfluous and the second sort
is an error that can lead to feature being returned corrupted with the
sub-features in an incorrect order. So Ive marked this major...

Ive been trying to implement the feature/sub-feature locations test in
test_BioSQL_SeqIO.

Here's my solution (attached as patch1):

"""
        # Compare sub-feature Locations:
        # 
        # BioSQL currently does not store fuzzy locations, but instead stores
        # them as FeatureLocation.nofuzzy_start FeatureLocation.nofuzzy_end.
        # Hence, the old_sub from SeqIO.parse() will have fuzzy location while
        # new_sub locations from BioSQL will be fuzzy.
        # The vast majority of cases will be comparisons of ExactPosition
        # class locations, so we'll try that first and catch the exceptions.

        try:
            assert str(old_sub.location) == str(new_sub.location), \
               "%s -> %s" % (str(old_sub.location), str(new_sub.location))
        except AssertionError, e:
            if isinstance(old_sub.location.start, ExactPosition) and \
                isinstance(new_sub.location.start, ExactPosition) and \
                isinstance(old_sub.location.end, ExactPosition) and \
                isinstance(new_sub.location.end, ExactPosition):
                # Its not a problem with fuzzy locations, re-raise 
                raise AssertionError, e
            else:
                #At least one location is fuzzy
                assert old_sub.location.nofuzzy_start ==
new_sub.location.nofuzzy_start, \
                    "%s -> %s" % (old_sub.location.nofuzzy_start,
new_sub.location.nofuzzy_start)
                assert old_sub.location.nofuzzy_end ==
new_sub.location.nofuzzy_end, \
                   "%s -> %s" % (old_sub.location.nofuzzy_end,
new_sub.location.nofuzzy_end)
"""

This test causes errors in 3 of the test cases:
GenBank/extra_keywords.gb
GenBank/one_of.gb
GFF/NC_001422.gbk

e.g:
Testing loading from genbank format file GenBank/extra_keywords.gb
 - TCCAGGGGATTCACGCGCA...TTG [Gp6GqZ3Q9foPG0HvyXguIGSJN8U] len 154329,
AL138972.1
 - Retrieving by name/display_id 'DMBR25B3',
Traceback (most recent call last):
  File "test_BioSQL_SeqIO.py", line 371, in <module>
    compare_records(record, db_rec)
  File "test_BioSQL_SeqIO.py", line 280, in compare_records
    compare_features(old_f, new_f)
  File "test_BioSQL_SeqIO.py", line 185, in compare_features
    raise AssertionError, e
AssertionError: [153489:154269] -> [40:610]

This is because each of these records has a peculiar join(...)
for the above record:
join(153490..154269,AL121804.2:41..610,

(an aside how does the user know that returned feature location is a join
with a separate accession? How does BioSQL/biopython deal with this?)

The error is caused by BioSeq.py _retrieve_features() sorting the sub-features
first by sorting on start position:

BioSeq.py:
249                 sub_feature_list.append((start, subfeature))
250             sub_feature_list.sort()
251             feature.sub_features = [sub_feature[1]
252                                     for sub_feature in sub_feature_list]

This is an error because it returns the sub-features out of order. Besides this
sub-feature sort, and the seqFeature sort, are both unnecessary because the
features and sub-features are stored in BioSQL by rank and retrieved by rank,
so
they should be in the correct order anyway.

Attached BioSeq.py patch to remove both sort()'s - patch2

With these patches applied the test_BioSQL_SeqIO and test_BioSQL pass:

[cymon at chara Tests]$ python test_BioSQL_SeqIO.py > test_output
[cymon at chara Tests]$ diff -ruN test_output output/test_BioSQL_SeqIO 
--- test_output 2008-11-13 15:39:20.000000000 +0000
+++ output/test_BioSQL_SeqIO    2008-11-12 13:06:19.000000000 +0000
@@ -1,3 +1,4 @@
+test_BioSQL_SeqIO
 Connecting to database
 Removing existing sub-database 'biosql-seqio-test' (if exists)
 (Re)creating empty sub-database 'biosql-seqio-test'
[cymon at chara Tests]$ python run_tests.py test_BioSQL_SeqIO.py
test_BioSQL_SeqIO ... ok

----------------------------------------------------------------------
Ran 1 test in 15.928s

OK
[cymon at chara Tests]$ python run_tests.py test_BioSQL.py
test_BioSQL ... ok

----------------------------------------------------------------------
Ran 1 test in 25.255s

OK


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list