[Biopython-dev] GenBank parser fails (on large files?)

Mon Oct 8 23:39:40 EDT 2001

Michel:
> [Talking about the translation]
> > Note, incidentally, that this is a bit ugly, because the \012's and spaces 
> > should have been cleaned out 

me:
> I agree with you here -- I haven't yet done any work at massaging
> the feature value information. I'll think about a good way to do
> this (I'm sure there are other cases where this also needs to be
> done), and try to get something done on it this weekend.

I finally managed to come up with a good (in my opinion, of course)
way to handle the problem of selectively cleaning up values based on
their type. Basically, what I did was add a Bio.GenBank.utils class
that has a FeatureValueCleaner class. Right now this class is quite
simple (it just deals with the translation problem mentioned), but
could be extended quite easily to deal with other special cases as
they come up.

You can use this class by passing it as a feature_cleaner argument
to the FeatureParser, ie:

from Bio import GenBank
from Bio.GenBank.utils import FeatureValueCleaner

parser = GenBank.FeatureParser(feature_cleaner =
                               FeatureValueCleaner())

Right now this is not enabled by default, but I'm definately 
open to opinions about whether or not it should be. 

Michel, I'd be happy to hear if this does what you'd like it to. If
you have additional things that need cleaning up, I'd be more than
happy to accept patches against utils.py adding these things. The
utils.py class is attached, along with the patch against
__init__.py. These are also checked into CVS. 

Hope this works for you.
Brad
-- 
PGP public key available from http://pgp.mit.edu/
-------------- next part --------------
*** __init__.py.orig	Thu Sep 27 16:00:49 2001
--- __init__.py	Mon Oct  8 23:16:50 2001
***************
*** 239,245 ****
  class FeatureParser:
      """Parse GenBank files into Seq + Feature objects.
      """
!     def __init__(self, debug_level = 0, use_fuzziness = 1):
          """Initialize a GenBank parser and Feature consumer.

          Arguments:
--- 239,246 ----
  class FeatureParser:
      """Parse GenBank files into Seq + Feature objects.
      """
!     def __init__(self, debug_level = 0, use_fuzziness = 1, 
!                  feature_cleaner = None):
          """Initialize a GenBank parser and Feature consumer.

          Arguments:
***************
*** 249,262 ****
          you can set this as high as two and see exactly where a parse fails.
          o use_fuzziness - Specify whether or not to use fuzzy representations.
          The default is 1 (use fuzziness).
          """
          self._scanner = _Scanner(debug_level)
          self.use_fuzziness = use_fuzziness

      def parse(self, handle):
          """Parse the specified handle.
          """
!         self._consumer = _FeatureConsumer(self.use_fuzziness)
          self._scanner.feed(handle, self._consumer)
          return self._consumer.data

--- 250,268 ----
          you can set this as high as two and see exactly where a parse fails.
          o use_fuzziness - Specify whether or not to use fuzzy representations.
          The default is 1 (use fuzziness).
+         o feature_cleaner - A class which will be used to clean out the
+         values of features. This class must implement the function 
+         clean_value. GenBank.utils has a "standard" cleaner class.
          """
          self._scanner = _Scanner(debug_level)
          self.use_fuzziness = use_fuzziness
+         self._cleaner = feature_cleaner

      def parse(self, handle):
          """Parse the specified handle.
          """
!         self._consumer = _FeatureConsumer(self.use_fuzziness, 
!                                           self._cleaner)
          self._scanner.feed(handle, self._consumer)
          return self._consumer.data

***************
*** 398,409 ****
      Attributes:
      o use_fuzziness - specify whether or not to parse with fuzziness in
      feature locations.
      """
!     def __init__(self, use_fuzziness):
          _BaseGenBankConsumer.__init__(self)
          self.data = SeqRecord(None, id = None)

          self._use_fuzziness = use_fuzziness

          self._seq_type = ''
          self._seq_data = []
--- 404,418 ----
      Attributes:
      o use_fuzziness - specify whether or not to parse with fuzziness in
      feature locations.
+     o feature_cleaner - a class that will be used to provide specialized
+     cleaning-up of feature values.
      """
!     def __init__(self, use_fuzziness, feature_cleaner = None):
          _BaseGenBankConsumer.__init__(self)
          self.data = SeqRecord(None, id = None)

          self._use_fuzziness = use_fuzziness
+         self._feature_cleaner = feature_cleaner

          self._seq_type = ''
          self._seq_data = []
***************
*** 856,861 ****
--- 865,872 ----
          if self._cur_qualifier_key:
              key = self._cur_qualifier_key
              value = self._cur_qualifier_value
+             if self._feature_cleaner is not None:
+                 value = self._feature_cleaner.clean_value(key, value)
              # if the qualifier name exists, append the value
              if self._cur_feature.qualifiers.has_key(key):
                  self._cur_feature.qualifiers[key].append(value)
-------------- next part --------------
"""Useful utilities for helping in parsing GenBank files.
"""
# standard library
import string

class FeatureValueCleaner:
    """Provide specialized capabilities for cleaning up values in features.

    This class is designed to provide a mechanism to clean up and process
    values in the key/value pairs of GenBank features. This is useful 
    because in cases like:

         /translation="MED
         YDPWNLRFQSKYKSRDA"

    you'll end up with a value with \012s and spaces in it like:
        "MED\012 YDPWEL..."

    which you probably don't want. 

    This cleaning needs to be done on a case by case basis since it is
    impossible to interpret whether you should be concatenating everything
    (as in translations), or combining things with spaces (as might be
    the case with /notes).
    """
    keys_to_process = ["translation"]
    def __init__(self, to_process = keys_to_process):
        """Initialize with the keys we should deal with.
        """
        self._to_process = to_process

    def clean_value(self, key_name, value):
        """Clean the specified value and return it.

        If the value is not specified to be dealt with, the original value
        will be returned.
        """
        if key_name in self._to_process:
            try:
                cleaner = getattr(self, "_clean_%s" % key_name)
                value = cleaner(value)
            except AttributeError:
                raise AssertionError("No function to clean key: %s" 
                                     % key_name)
        return value

    def _clean_translation(self, value):
        """Concatenate a translation value to one long protein string.
        """
        translation_parts = value.split()
        return string.join(translation_parts, '')