[Bioperl-l] Fix for Bug #3376 broke somewhere else

Sat Mar 2 17:28:15 UTC 2013

Hi Francisco,

Nice catch. Please submit a new bug report for this and reference bug
3376. Please provide a minimal hmmer output file, a script and the
expected output. Then, I'll look into it and fix the bug.

Thank you,

Paul

Paul Cantalupo
University of Pittsburgh

On Thu, Feb 28, 2013 at 10:36 AM, Francisco J. Ossandón
<fossandonc at hotmail.com> wrote:
> Hi,
> I was re-checking Bug #3302 using the Bio::SearchIO modules of the
> repository and found that now it can't parse a Hmmer2 file that was
> previously fine. After tracking the problem, I discovered that a change in a
> regular expression to fix another bug broke the parse.
>
> The fix for the Bug #3376 consisted in adding an extra condition to omit
> lines where end of domain indicator is split across lines
> (https://redmine.open-bio.org/issues/3376):
> TEST: domain 1 of 1, from 8 to 97: score 184.7, E = 2.5e-56
>                    *->svfqqqqssksttgstvtAiAiAigYRYRYRAvtWnsGsLssGvnDn
>                       sv+qqqq+  +    +vtAiAiAigYRYRYRAv Wn GsLs G nDn
>         Test     8    SVYQQQQGGSA----MVTAIAIAIGYRYRYRAVVWNKGSLSTGTNDN 50
>
>                    DnDqqsdgLYtiYYsvtvpssslpsqtviHHHaHkasstkiiikiePr<-
>                    DnDq +d LYtiYYsvtv +ss+p q+v+HHHaH+asstkiiiki P
>         Test    51 DNDQAAD-LYTIYYSVTVSASSWPGQSVTHHHAHPASSTKIIIKIAPS   97
>
>                    *
>
>         Test     -   -
> This case is characterized by the 2 dashes in the line...
>
> So the expression added in hmmer2.pm - ‘next_result’
> (https://github.com/bioperl/bioperl-live/commit/142e5d79e3a6593db32bf0af9904
> 8f47d01bd3f2):
>                         elsif (CORE::length($_) == 0
>                             || ( $count != 1 && /^\s+$/o )
>                             || /^\s+\-?\*\s*$/
>                             || /^.+\-\s+\-\s*$/ ) ### <--- This regex was
> designed for bug 3376
>                         {
>                             next;
>                         }
>
> But the expression used is too broad because it uses the "^.+" just before
> the 2 dashes, and it broke these lines parsing, where is full of dashes:
>                    KyACrqCdtiVQAPaPakpIErGiptaGLLArvlVSKyaEHlPLYRQsEI
>
>   lcl|gi|340     - -------------------------------------------------- -
>
>                    yaRqGVeiaRstLadWVgrtgarLaPLvdALaeyVLkeGklHADeTPVqV
>                          +i  s L   V++ + r
>   lcl|gi|340 60938 ------AIMISGLIHGVSARCLRF-------------------------- 60955
>
> I think a reasonable fix that still fixes the original bug and restore the
> function for this case is to add an extra \s+ in the regex just before the
> first dash, so the expression makes sure that the first dash is the one that
> comes AFTER the description (and is replacing the usual coordinate number)
> and is not the last of an alignment or a series of dashes like the one
> above:
>                         elsif (CORE::length($_) == 0
>                             || ( $count != 1 && /^\s+$/o )
>                             || /^\s+\-?\*\s*$/
>                             || /^.+\s+\-\s+\-\s*$/ ) ### <--- Tweaked regex
>                         {
>                             next;
>                         }
> I tested it and it works fine, hope you find the fix acceptable.
>
> Cheers,
>
> --
> Francisco J. Ossandon
> Bioinformatician.
> Ph.D. Candidate, University Andres Bello.
> Center for Bioinformatics and Genome Biology,
> Fundacion Ciencia para la Vida.
> Santiago, Chile.
> www.cienciavida.cl/CBGB.htm
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l