[Biopython-dev] [Bug 2601] New: Seq find() method: proposal

Mon Sep 29 12:35:34 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2601

           Summary: Seq find() method: proposal
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: lpritc at scri.sari.ac.uk

A find() method for the Seq object was recently proposed on the mailing list. 
I have extended Seq locally to include a find method that uses the re module
and the reverse_complement function from Bio.Seq, and is described below.  In
the original implementation, the search was meant to be called from the parent
SeqRecord object, which populated itself with features describing the search
results.

I'm proposing this as a potential starting point for the implementation of a
Seq.find() method.  

Note that the loop of re.search() calls was necessary to obtain the set of
overlapping matches, as re.finditer() only returns non-overlapping matches. 
The two functions searching in forward-only and reverse-only directions could
probably be combined, and behaviour distinguished on keyword, for neater code.

####

    def find_regexes(self, pattern):
        """ find_regexes(self, pattern)

            pattern           String, regular expression to search for

            Finds all occurrences of the passed regular expression in the
            sequence, and returns a list of tuples in the format:
            (start, end, match, strand).

            If the sequence is a nucleotide sequence, the reverse strand is
            also searched
        """
        # Find forward matches
        match_locations = [(hit.start()+1, hit.end(), \
                            self.data[hit.start():hit.end()], 1) \
                           for hit in self.__find_overlapping_regexes(pattern)]
        # If the sequence is a nucleotide sequence, look on the reverse
        # strand, too
        if self.alphabet.__class__ in [Alphabet.DNAAlphabet,
                                       Alphabet.RNAAlphabet,
                                       IUPAC.ExtendedIUPACDNA,
                                       IUPAC.IUPACAmbiguousDNA,
                                       IUPAC.IUPACUnambiguousDNA,
                                       IUPAC.IUPACAmbiguousRNA,
                                       IUPAC.IUPACUnambiguousRNA]:
            rev_locations = [(hit.start()+1, hit.end(), \
                              self.data[hit.start():hit.end()], 1) \
                             for hit in \
                             self.__find_overlapping_regexes_rev(pattern)]
            match_locations += rev_locations
        match_locations.sort()
        return match_locations

    def __find_overlapping_regexes(self, pattern):
        """ Finds all overlapping regexes matching the passed pattern in the
            sequence, and returns a list of re.SRE_Match objects describing
            them.
        """
        hits = []
        pos = 0
        regex = re.compile(pattern)
        while pos < len(self.data):
            hit = regex.search(self.data, pos=pos)
            if hit is None:
                break
            hits.append(hit)
            pos = hit.start()+1
        return hits

    def __find_overlapping_regexes_rev(self, pattern):
        """ Finds all overlapping regexes matching the passed pattern in the
            sequence, and returns a list of re.SRE_Match objects describing
            them, as hits positioned in the forward direction - i.e. start and
            end read in the forward sense.
        """
        hits = []
        pos = 0
        regex = re.compile(reverse_complement(Seq(pattern, self.alphabet)))
        while pos < len(self.data):
            hit = regex.search(self.data, pos=pos)
            if hit is None:
                break
            hits.append(hit)
            pos = hit.start()+1
        return hits

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.