[Bioperl-l] Bio::AlignIO::Mase

Jason Stajich jason.stajich at gmail.com
Wed Jun 8 15:27:48 UTC 2011


It is a good question of what should it keep? This could be too much of nanny-state bioinformatics... =)

It is mainly to get rid of internal whitespaces. Some of the alphabets supported by the Bio::Seq objects could balk at the non-standard symbols.

This is also to perhaps sanitize the data before building the object so that it can be translated back out to different formats that cannot support non-character symbols.  

It is critical to remove the whitespaces though since the sequence object should just have data and so that everything is still aligned.

Jason
On Jun 8, 2011, at 10:05 AM, Tristan Lefebure wrote:

> Thanks all for your answers. I didn't know about [^] (always 
> something new to learn with perl...).
> 
> Yes, the point is to keep numerical characters (like 
> frameshift) in the MASE parser (and others actually, the 
> FASTA and NEXUS don't try to strip anything AFAIK).
> 
> May be I should rephrase it this way: should any parser try 
> to strip anything from the sequence string? I actually 
> wonder what is the original purpose of the line 
> 	$entry =~ s/[^A-Za-z\.\-]//g;
> in the MASE parser. Couldn't we just replace it with:
> 	chomp $entry;
> 
> If I do so and run the test I get:
> 
> [tristan at picodon bioperl-live] perl t/AlignIO/mase.t
> 1..3
> ok 1 - use Bio::AlignIO::mase;
> ok 2 - The object isa Bio::Align::AlignI
> ok 3 - mase input test 
> 
> (Chris, how do you run this:
> 	./Build test --test-files t/AlignIO/mase.t --verbose
> 
> The only thing I manage to do is:
> 
> [tristan at picodon bioperl-live] ./Build.PL test --test-files 
> t/AlignIO/mase.t --verbose
> Too early to specify a build action 'test'.  Do 'Build test' 
> instead.
> )
> 
> --
> Tristan
> 
> On Wednesday 08 June 2011 16:27:44 Jason Stajich wrote:
>> Hi Tristan -
>> 
>> This regular expression is to is to strip everything that
>> isn't a letter, . or - the [^] means match everything
>> EXCEPT what follows.  I guess if numeric values are
>> valid in these type of alignments you would just add \d
>> (instead of 0-9)
>> 
>> So you are asking for the parser to not strip out
>> frameshift info from a MASE parser?
>> 
>> This doesn't have anything to do with the chunk pattern
>> or size set with $/ AFAIK.
>> 
>> On Jun 8, 2011, at 7:45 AM, Tristan Lefebure wrote:
>>> Hi there,
>>> 
>>> I have some weird alignments with some numerical code
>>> stored within the sequence strings (eg. frameshift
>>> genewise code). Most AlignIO module I have tried eat
>>> them without any trouble except for
>>> Bio::AlignIO::Mase.
>>> 
>>> The following patch seems to do the trick:
>>> 
>>> diff -u mase.pm mase_mod.pm
>>> --- mase.pm     2011-06-08 14:08:58.558033996 +0200
>>> +++ mase_mod.pm 2011-06-08 14:09:20.388066014 +0200
>>> @@ -109,7 +109,7 @@
>>> 
>>>       while( $entry = $self->_readline) {
>>> 
>>>           $entry =~ /^;/ && last;
>>> 
>>> -           $entry =~ s/[^A-Za-z\.\-]//g;
>>> +           $entry =~ s/[^A-Za-z0-9\.\-]//g;
>>> 
>>>           $seq .= $entry;
>>> 
>>>       }
>>>       if( $end == -1) {
>>> 
>>> But I am left with the feeling that I don't really
>>> understand why this works (which I don't quite like
>>> before pushing a patch...)
>>> 
>>> Why doing a s///g instead of a simple m//, and why
>>> doing '/[^' and not '/^['... Is that linked to that
>>> fact that $/ was modified to read chunks of files? BTW
>>> where is $/ set? I searched in Bio::Root::IO but
>>> didn't find it...
>>> 
>>> Oh so many questions...
>>> 
>>> Thanks!
>>> 
>>> --
>>> Tristan
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list