[Bioperl-l] Other: Script for editing alignments?

Thu Aug 12 18:37:11 UTC 2010

Hi Si -

This is pretty straightforward with Bioperl. Here's one solution:

#!/usr/bin/perl -w
use strict;
use Bio::AlignIO;
my $in = Bio::AlignIO->new(-format => 'fasta', -file => shift @ARGV);
my $out = Bio::AlignIO->new(-format => 'fasta');

while( my $aln = $in->next_aln ) {
  for my $seq ( $aln->each_seq ) {
  my $str = $seq->seq;
  if( $str =~ /^(-+)/ ) {
     my $rep = length($1);
# replace from the 5' end
     substr($str,0,$rep,'N'x$rep);
  }
  if( $str =~ /(-+)$/ ) {
    my $rep = length($1);
   # replace from the 3' end
    substr($str,-1 * $rep,length($str),'N'x$rep);
  }
     $seq->seq($str);
  }
  # don't print the /start-end info in the FASTA ID
  $aln->set_displayname_flat(1);
  $out->write_aln($aln);
}

-jason

evoldir at evol.biology.mcmaster.ca wrote, On 8/11/10 11:18 PM:
> Dear All
>
> Alignment programs like MUSCLE and Clustal often output alignments with
> "-" symbols indicating indels (real events) within sequence alignments,
> but also "-" symbols at the 5' and 3' ends of sequences. The latter
> however, are not real evolutionary events and really should be Ns
> (missing data), depending on the sort of analytical framework you use.
>
> If there is sufficient heterogeneity and signal within the 5' and 3'
> ends of sequences, the "-"s can be manually edited in a text editor to
> Ns with no problem, if the alignment is small. If it is large (e.g. 2000
> seqs), or there are lots of alignments, it becomes a lengthy task.
>
> I'm investigating such alignments presently and so was wondering if
> anyone had a clever way of implementing sed, or had a Perl script that
> would perform such a task. Simply put, it would require replacing the 5'
> and 3' "-" below only with Ns and leaving the within sequence "-"s
> alone. The sequences naturally may span more than one line.
>
>   >Taxon 1
> -----ATGCTG--TGACTG----TGACT---
>   >Taxon 2
> ---GTATGTTG--TGACTGCT--TGACCGTC
>
> to
>
>   >Taxon 1
> NNNNNATGCTG--TGACTG----TGACTNNN
>   >Taxon 2
> NNNGTATGTTG--TGACTGCT--TGACCGTC
>
> It's a simple task, but I haven't seen any scripts out there to do the job.
>
> If there are any scripters out there who can help, or if someone knows
> of an application that would help, it would be great to hear from you.
>
> With best wishes and thanks
>
> Si Creer
>
>