[Bioperl-l] Parsing FASTA m10 output

Ioannis Kirmitzoglou ioanniskirmitzoglou at gmail.com
Thu Apr 19 14:06:06 UTC 2007


I have reported it as a bug on the bugzilla but due to bugzilla problems I
was not able to attach my code and/or sample m10 files.
Nevertheless here is the code that converts an m10 fasta output to an m8
BLAST output which is parseable by the vast majority of software.

<----------- CODE BEGINS HERE ------------------->

#!/usr/bin/perl -w

=head1 NAME

fastam10_to_table  - turn FASTA -m 10 output into NCBI -m 8 tabular output

=head1 SYNOPSIS

 fastam10_to_table [--header] [-o outfile] inputfile1 inputfile2 ...

=head1 DESCRIPTION

Command line options:
  --header                -- boolean flag to print column header
  -o/--out                -- optional outputfile to write data,
                             otherwise will write to STDOUT
  -h/--help               -- show this documentation

Not technically a SearchIO script as this doesn't use any Bioperl
components but is a useful and fast.  The output is tabular output
with the standard NCBI -m8 columns.

 queryname
 hit name
 percent identity
 alignment length
 number mismatches
 number gaps
 query start  (if on rev-strand start > end)
 query end
 hit start (if on rev-strand start > end)
 hit end
 evalue
 bit score

Additionally 4 more columns are provided
 percent similar
 query length
 hit length
 query gaps
 hit gaps

=head1 AUTHOR - Ioannis Kirmitzoglou

Ioannis Kirmitzoglou IoannisKirmitzoglou_at_gmail-dot-org

=head1 ACKNOWLEDGMENTS - Ioannis Kirmitzoglou

Headers as well as portions of code were taken
from fastam9_to_table.pl by Jason Stajich

=head1 DISCLAIMER

Copyright (c) <2007> <Ioannis Kirmitzolgou>

Permission to use, copy, modify, merge, publish and distribute
this software and its documentation, with or without modification,
for any purpose, and without fee or royalty to the copyright holder(s)
is hereby granted with no restictions and/or prerequisites.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

=cut

use strict;
use Getopt::Long;

my %data=();

my $outfile=''; my $header='';
GetOptions(
    'header'              => \$header,
    'o|out|outfile:s'     => \$outfile,
    'h|help'              => sub { exec('perldoc',$0); exit; }
       );

my $outfh;
if( $outfile ) {
    open($outfh, ">$outfile") || die("$outfile: $!");
} else {
    $outfh = \*STDOUT;
}


$/="\n>>>";

my @fields = qw(qname hname percid alen mmcount gapcount
        qstart qend hstart hend evalue bits percsim qlen hlen qgap hgap);

print $outfh "#",uc(join("", map{ sprintf("%-10s",$_) } @fields)), "\n" if
$header;

while (<>) {

        chomp;
        if ($_=~/^>/ || $_=~/^\#/) {next;}
        my @hits = split(/\d+>>/, $_);
        @hits= split("\n>>", $hits[0]);

        my $hit = shift @hits;

        ($data{'qname'}, $data{'qlen'} ) = ($hit=~ (/(\S+)\,\s(\d+)/));

        foreach my $align (@hits) {

            my @details= split ("\n>", $align);
           my $detail = shift @details;
            ($data{'hname'}) = ($detail =~ (/^(\S+)\s/));
            $detail=~ /\;\s(?:fa|sw)\_bits\:\s+(\S+)/;
            $data{'bits'}=$1;
            $detail=~ /\;\s(?:fa|sw)\_expect\:\s+(\S+)/;
            $data{'evalue'}=$1;

            my $term = quotemeta("; sw_score");
            $term =~ s/\\ /\\s/;
            $detail=~ /$term\s+(\S+)/;
            $data{'score'}=$1;

            $term = quotemeta("; sw_ident:");
            $term =~ s/\\ /\\s/;
            $detail=~ /$term\s+(\S+)/;
            $data{'percid'}=$1;

            $term = quotemeta("; sw_sim:");
            $term =~ s/\\ /\\s/;
            $detail=~ /$term\s+(\S+)/;
            $data{'percsim'}=$1;

            $term = quotemeta("; sw_overlap:");
            $term =~ s/\\ /\\s/;
            $detail=~ /$term\s+(\S+)/;
            $data{'alen'}=$1;

            $detail = shift @details;

            $term = quotemeta("; al_start:");
            $term =~ s/\\ /\\s/;
            $detail=~ /$term\s+(\S+)/;
            $data{'qstart'}=$1;

            $term = quotemeta("; al_stop:");
            $term =~ s/\\ /\\s/;
            $detail=~ /$term\s+(\S+)/;
            $data{'qend'}=$1;

            $term = quotemeta("; al_display_start:");
            $term =~ s/\\ /\\s/;
            my $lakis ='';
            $detail=~ /$term.+\s\-*([\-\w\s]+)/g;

            $data{'qgap'}=($1 =~ tr/\-//);

            $detail = shift @details;

            $term = quotemeta("; sq_len:");
            $term =~ s/\\ /\\s/;
            $detail=~ /$term\s+(\S+)/;
            $data{'hlen'}=$1;

            $term = quotemeta("; al_start:");
            $term =~ s/\\ /\\s/;
            $detail=~ /$term\s+(\S+)/;
            $data{'hstart'}=$1;

            $term = quotemeta("; al_stop:");
            $term =~ s/\\ /\\s/;
            $detail=~ /$term\s+(\S+)/;
            $data{'hend'}=$1;

            $term = quotemeta("; al_display_start:");
            $term =~ s/\\ /\\s/;
            $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
            $data{'hgap'}=($1 =~ tr/-//);
            $data{'gapcount'} = $data{'qgap'} + $data{'hgap'};
            $data{'mmcount'} = $data{'alen'} - ( int($data{'percid'} *
$data{'alen'}) + $data{'gapcount'});

for ( $data{'percid'}, $data{'percsim'} ) {
    $_ = sprintf("%.2f",$_*100);
}

            print $outfh join( "\t",map { $data{$_} } @fields),"\n"
        }

}

<----------------- CODE ENDS HERE ---------------------->

-- 

*Ioannis Kirmitzoglou*, MSc
PhD. Student,
Bioinformatics Research Laboratory
Department of Biological Sciences
University of Cyprus



More information about the Bioperl-l mailing list