[Bioperl-l] Other: Script for editing alignments?
Jason Stajich
jason at bioperl.org
Thu Aug 12 18:37:11 UTC 2010
Hi Si -
This is pretty straightforward with Bioperl. Here's one solution:
#!/usr/bin/perl -w
use strict;
use Bio::AlignIO;
my $in = Bio::AlignIO->new(-format => 'fasta', -file => shift @ARGV);
my $out = Bio::AlignIO->new(-format => 'fasta');
while( my $aln = $in->next_aln ) {
for my $seq ( $aln->each_seq ) {
my $str = $seq->seq;
if( $str =~ /^(-+)/ ) {
my $rep = length($1);
# replace from the 5' end
substr($str,0,$rep,'N'x$rep);
}
if( $str =~ /(-+)$/ ) {
my $rep = length($1);
# replace from the 3' end
substr($str,-1 * $rep,length($str),'N'x$rep);
}
$seq->seq($str);
}
# don't print the /start-end info in the FASTA ID
$aln->set_displayname_flat(1);
$out->write_aln($aln);
}
-jason
evoldir at evol.biology.mcmaster.ca wrote, On 8/11/10 11:18 PM:
> Dear All
>
> Alignment programs like MUSCLE and Clustal often output alignments with
> "-" symbols indicating indels (real events) within sequence alignments,
> but also "-" symbols at the 5' and 3' ends of sequences. The latter
> however, are not real evolutionary events and really should be Ns
> (missing data), depending on the sort of analytical framework you use.
>
> If there is sufficient heterogeneity and signal within the 5' and 3'
> ends of sequences, the "-"s can be manually edited in a text editor to
> Ns with no problem, if the alignment is small. If it is large (e.g. 2000
> seqs), or there are lots of alignments, it becomes a lengthy task.
>
> I'm investigating such alignments presently and so was wondering if
> anyone had a clever way of implementing sed, or had a Perl script that
> would perform such a task. Simply put, it would require replacing the 5'
> and 3' "-" below only with Ns and leaving the within sequence "-"s
> alone. The sequences naturally may span more than one line.
>
> >Taxon 1
> -----ATGCTG--TGACTG----TGACT---
> >Taxon 2
> ---GTATGTTG--TGACTGCT--TGACCGTC
>
> to
>
> >Taxon 1
> NNNNNATGCTG--TGACTG----TGACTNNN
> >Taxon 2
> NNNGTATGTTG--TGACTGCT--TGACCGTC
>
> It's a simple task, but I haven't seen any scripts out there to do the job.
>
> If there are any scripters out there who can help, or if someone knows
> of an application that would help, it would be great to hear from you.
>
> With best wishes and thanks
>
> Si Creer
>
>
More information about the Bioperl-l
mailing list