[EMBOSS] fuzznuc pattern expansion

Bernd Web bernd.web at gmail.com
Wed Nov 2 15:12:11 UTC 2011


Dear Peter,

Thanks! It would indeed be great to have the option to seach on the
ambiguity codes directly. Probably, I'd prefer the escape option, but
you mean to implement both escaping and expansion to subsets?
This actually might be good in case a user does not know the contents
of the DNA file (ie which ambiguity codes are present).

It might be good to report the pattern that was used in the matching.
Would the (very high) speed of fuzznuc be affected by always exploding
the to the subsets? For example, "N" would become "ACTGUMRWSYKVHDB".
This could mean searches of patterns with high degeneracy would
include a lot of ambiguity codes.


Kind regards,
Bernd

On Sat, Oct 29, 2011 at 7:06 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 28/10/2011 18:03, Bernd Web wrote:
>>
>> Hi
>>
>> Using fuzznuc I get illegal pattern warnings. I realize what is going on:
>>
>> "You can use ambiguity codes for nucleic acid searches but not within
>> [] or {} as they expand to bracketed counterparts. For example, "s" is
>> expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is
>> illegal."
>>
>> However, what I cannot find it how to suppress this expansion. Is this
>> possible? We actually need to have these ambiguity remain as they are
>> within [] as the input sequences can contain R, Y, B, N themselves for
>> example. Thus, [GCS] is a pattern we actually want to be able to use.
>
> That looks a reasonable suggestion.
>
> We can replace S with [GCS] directly. For the wider ambiguity codes, we can
> replace them with the subsets:
>
> B [TGCBSYK]
> D [TGADWRK]
> H [TCAHWYM]
> V [GCAVSRM]
>
> We can also allow 'C\S' to explicitly match CS in the input sequence by
> escaping the S to skip the automatic expansion.
>
> These changes can be added to the next release.
>
> Thanks for the idea.
>
> Peter Rice
> EMBOSS Team
>
>



More information about the EMBOSS mailing list