[Bioperl-l] Bio::AlignIO::Mase

Wed Jun 8 15:05:33 UTC 2011

Thanks all for your answers. I didn't know about [^] (always 
something new to learn with perl...).

Yes, the point is to keep numerical characters (like 
frameshift) in the MASE parser (and others actually, the 
FASTA and NEXUS don't try to strip anything AFAIK).

May be I should rephrase it this way: should any parser try 
to strip anything from the sequence string? I actually 
wonder what is the original purpose of the line 
	$entry =~ s/[^A-Za-z\.\-]//g;
in the MASE parser. Couldn't we just replace it with:
	chomp $entry;

If I do so and run the test I get:

[tristan at picodon bioperl-live] perl t/AlignIO/mase.t
1..3
ok 1 - use Bio::AlignIO::mase;
ok 2 - The object isa Bio::Align::AlignI
ok 3 - mase input test 

(Chris, how do you run this:
	./Build test --test-files t/AlignIO/mase.t --verbose

The only thing I manage to do is:

[tristan at picodon bioperl-live] ./Build.PL test --test-files 
t/AlignIO/mase.t --verbose
Too early to specify a build action 'test'.  Do 'Build test' 
instead.
)

--
Tristan

On Wednesday 08 June 2011 16:27:44 Jason Stajich wrote:
> Hi Tristan -
> 
> This regular expression is to is to strip everything that
> isn't a letter, . or - the [^] means match everything
> EXCEPT what follows.  I guess if numeric values are
> valid in these type of alignments you would just add \d
> (instead of 0-9)
> 
> So you are asking for the parser to not strip out
> frameshift info from a MASE parser?
> 
> This doesn't have anything to do with the chunk pattern
> or size set with $/ AFAIK.
> 
> On Jun 8, 2011, at 7:45 AM, Tristan Lefebure wrote:
> > Hi there,
> > 
> > I have some weird alignments with some numerical code
> > stored within the sequence strings (eg. frameshift
> > genewise code). Most AlignIO module I have tried eat
> > them without any trouble except for
> > Bio::AlignIO::Mase.
> > 
> > The following patch seems to do the trick:
> > 
> > diff -u mase.pm mase_mod.pm
> > --- mase.pm     2011-06-08 14:08:58.558033996 +0200
> > +++ mase_mod.pm 2011-06-08 14:09:20.388066014 +0200
> > @@ -109,7 +109,7 @@
> > 
> >        while( $entry = $self->_readline) {
> >        
> >            $entry =~ /^;/ && last;
> > 
> > -           $entry =~ s/[^A-Za-z\.\-]//g;
> > +           $entry =~ s/[^A-Za-z0-9\.\-]//g;
> > 
> >            $seq .= $entry;
> >        
> >        }
> >        if( $end == -1) {
> > 
> > But I am left with the feeling that I don't really
> > understand why this works (which I don't quite like
> > before pushing a patch...)
> > 
> > Why doing a s///g instead of a simple m//, and why
> > doing '/[^' and not '/^['... Is that linked to that
> > fact that $/ was modified to read chunks of files? BTW
> > where is $/ set? I searched in Bio::Root::IO but
> > didn't find it...
> > 
> > Oh so many questions...
> > 
> > Thanks!
> > 
> > --
> > Tristan
> > 
> > 
> > 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l