[Biopython-dev] [Bug 2601] Seq find() method: proposal

Mon Sep 29 16:41:10 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2601

------- Comment #3 from lpritc at scri.sari.ac.uk  2008-09-29 12:41 EST -------
Make a cup of tea, this is a long one... ;)

Peter:
>Note that any Seq.find() method should be as like the string find method as
>possible for consistency. 
Bruce:
> While I
> see the point of the reverse complement and overlapping matches, these are
> inconsistent with re module

I see your points, but I'm not /entirely/ in agreement, here.  While I think
that it is nearly always a good thing that the input arguments and returned
results match those that are expected for same-named functions in similar
classes, I think that we may still take the opportunity to implement useful
behaviour that is relevant to biological sequences where the intent doesn't
stray too far from what you'd expect for a string.  For example, the ability to
accommodate ambiguous alphabets or regular expressions - not part of
string.find() - would be useful.  I think that this approach implements
additional functionality of which the string.find() method's functionality is a
subset, and so could be implemented without breaking the apparent identical
operation of string.find() and Seq.find().  This would facilitate the use of
string-specific third-party modules that could be useful for analysis of
biological sequences, while extending functionality.

Where I begin to disagree is on whether it is always desirable to constrain the
behaviour of these functions for the sake of consistency with other modules,
while still taking time to make them behave differently at all, rather than
just implementing that exact same behaviour, and handling the
biologically-useful stuff in a different method altogether.  I like the idea of
making Seq.py more string-like, in part because when I first started using
Biopython, I missed being able to slice, and other conveniently string-y
things.

By way of contrast: 
string.find() has the behaviour of only returning a single match - that which
is closest to the string start.  This might be useful to some (in ORF-finding,
perhaps), but I expect I would use a finditer() method that returned all
matches (for which there is no equivalent string method) almost exclusively, if
available.  I expect that I could cope quite happily with find() doing
different things on pure strings and on Seq objects, but I'd be OK with a
nonstandard finditer() alongside a 100% string-compatible find() as an
alternative to this, though I'd want finditer() to return overlapping matches.

Such overlapping matches, however, do not match re.finditer() behaviour.  But,
in this case, the re method's behaviour is constrained for good reasons related
to regular expression implementation, and not reasons related to biological
good sense.  I think that there is sufficient reason not to be consistent here,
and instead to return biologically-useful overlapping matches.

The core of my argument here is that we're not just working with strings, but
with string representations of biological objects; that's exactly why we have
this specialised library, and don't just use strings in the first place.  I
think that there will be occasions when we should break some syntactic
expectations, where it is appropriate for the problem domain, and that this
*might* (note equivocation) be one of them.

Peter:
>One enhancement is that it might be worth checking
>the search string is valid against the Seq object's alphabet (see also Bug
>2597).

Good point.  In the implementation I put up here, if there are any invalid
characters then the string just won't be found, which may be overgenerous to
user error ;)  Raising a ValueError or some such to let the user know that the
search alphabet wasn't valid would be very helpful.

Peter:
> To determine if a sequence has a nucleotide alphabet, use the fact that
> any well defined nucleotide alphabet object should be a subclass of
> Bio.Alphabet.NucleotideAlphabet() rather than checking a predefined list.

Fair enough - I didn't know that NucleotideAlphabet existed... I got as far up
the hierarchy as DNAAlphabet and RNAAlphabet, and stopped at working code ;)

Peter:
> However, there is no way of knowing if the sequence is double stranded or
> single sided, so personally I don't like the way your suggested function
> automatically searches the reverse complement strand too.

It just suited my purpose at the time.  Whether or not the nucleotide sequence
is single- or double-stranded, people might still want to search for a
complementary sequence; e.g. microarray/PCR/siRNA probes, etc.  The method as
written reports the strand on which the match can be found, and the user is
free to discard results as they see fit, which again suited me at the time.  A
'strand' argument to the method of 'forward', 'reverse', or 'both', or just
assuming 'both' if not specified would be better, I agree.  

What drove my implementation above was that, while nucleotide sequence matches
may or may not be of interest in either direction, reverse matches to protein
sequences are definitely (AFAIAC) not that interesting ;)

Bruce:
>I do think that any general function involving regular expressions should
> conform to the Python re module. The reasoning follows Peter's point that a
> user should not have to convert the Seq object into a Python string. 

I don't think I understand this point.  Would you prefer an re.search() like
implementation that takes a Seq object as its query argument?  I don't think
I'd find that as useful, myself, as a method that just takes a string.  Such a
method could also maybe parse arguments so as to compile the regex from the
Seq.data attribute though, fulfilling your requirement.

I used regular expression based searching in my implementation for speed, and
strictly speaking a string is also a regular expression, even if it doesn't
have special characters - I didn't see any inconsistency there.  My docstring
is maybe a bit misleading about that but, when I wrote it, it wasn't intended
for anyone but me to use.  Sorry about that.

Also, I disagree regarding conformance to the re module, particularly as our
use of re is likely to be less general than the re module itself - see above.

> So I think it would be more valuable to implement
> specific methods from the re modules. In this case, the functions should accept
> regular expression.

I would quite like to have a 'true' regular expression search method myself,
with wildcards for nucleotide symbols, but this would have to be implemented
differently to my attempt above:  e.g., for proper reverse complement searches,
you'd have to reverse complement the wildcards as well as ambiguity codes.

> I also do not see the gain for the reverse complement because this is just
> another pattern. 

The gain was that I needed matches to my patterns of interest on the sequence
in either direction, and I only cared which strand they lay on for reasons of
locating them.  Reverse complementing the query is usually quicker than reverse
complementing the genome on which you search.  Assuming you're searching on a
genome, of course ;)

> Also it is potentially confusing because the direction is not
> immediately apparent without further computation. 

I'm not sure I understand you: in teh above code, the method returns the strand
on which the match is found, along with all the other data.  The computation
required to handle this is the same as that to find the start and end points:
parse an integer from the tuple.  I'm not intending that the return type should
be set in stone and, as I mentioned, it was just a handy step in the creation
of SeqFeatures in the parent SeqRecord.

> In this case I think that
> 'explicit is better than implicit' (The Zen of Python) so I think the decision
> to use the reverse complement must come prior to the use of this method.

In the spirit of quoted arguments from authority: "A foolish consistency is the
hobgoblin of little minds" (Python Style Guide) ;)   

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.