[Bioperl-l] Parsing FASTA m10 output

Sat Apr 21 20:09:48 UTC 2007

Ioannis's fastm10_to_table script is available in the bugzilla  
enhancement request (as an attachment) if anyone's interested:

http://bugzilla.open-bio.org/show_bug.cgi?id=2278

I haven't had a chance to really look into m10 output yet but it  
looks easy enough to parse; may not be hard to get something SearchIO- 
based up and running.

chris

On Apr 21, 2007, at 12:44 PM, Jason Stajich wrote:

> We don't have one yet. This is a new format introduced in the most
> recent release of FASTA.  Hopefully someone can make some time to add
> some code to SearchIO::fasta for it.
>
> I do find that I when I need a fast FASTA to TAB converter that the
> simple script (fastam9_to_table) is more efficient that SearchIO
> framework so Ioannis is making a parallel one for the new m10
> output.  So I think having both is useful.
>
> -jason
> On Apr 21, 2007, at 10:14 AM, Hilmar Lapp wrote:
>
>> I haven't kept track of this - did this go anywhere? Do we not have
>> an -m10 fasta output parser in SearchIO? (I.e., my first thought
>> would be that that would be the desired solution; am I misled in
>> this?)
>>
>> 	-hilmar
>>
>> On Apr 19, 2007, at 10:06 AM, Ioannis Kirmitzoglou wrote:
>>
>>> I have reported it as a bug on the bugzilla but due to bugzilla
>>> problems I
>>> was not able to attach my code and/or sample m10 files.
>>> Nevertheless here is the code that converts an m10 fasta output to
>>> an m8
>>> BLAST output which is parseable by the vast majority of software.
>>>
>>> <----------- CODE BEGINS HERE ------------------->
>>>
>>> #!/usr/bin/perl -w
>>>
>>> =head1 NAME
>>>
>>> fastam10_to_table  - turn FASTA -m 10 output into NCBI -m 8 tabular
>>> output
>>>
>>> =head1 SYNOPSIS
>>>
>>>  fastam10_to_table [--header] [-o outfile] inputfile1 inputfile2 ...
>>>
>>> =head1 DESCRIPTION
>>>
>>> Command line options:
>>>   --header                -- boolean flag to print column header
>>>   -o/--out                -- optional outputfile to write data,
>>>                              otherwise will write to STDOUT
>>>   -h/--help               -- show this documentation
>>>
>>> Not technically a SearchIO script as this doesn't use any Bioperl
>>> components but is a useful and fast.  The output is tabular output
>>> with the standard NCBI -m8 columns.
>>>
>>>  queryname
>>>  hit name
>>>  percent identity
>>>  alignment length
>>>  number mismatches
>>>  number gaps
>>>  query start  (if on rev-strand start > end)
>>>  query end
>>>  hit start (if on rev-strand start > end)
>>>  hit end
>>>  evalue
>>>  bit score
>>>
>>> Additionally 4 more columns are provided
>>>  percent similar
>>>  query length
>>>  hit length
>>>  query gaps
>>>  hit gaps
>>>
>>> =head1 AUTHOR - Ioannis Kirmitzoglou
>>>
>>> Ioannis Kirmitzoglou IoannisKirmitzoglou_at_gmail-dot-org
>>>
>>> =head1 ACKNOWLEDGMENTS - Ioannis Kirmitzoglou
>>>
>>> Headers as well as portions of code were taken
>>>> from fastam9_to_table.pl by Jason Stajich
>>>
>>> =head1 DISCLAIMER
>>>
>>> Copyright (c) <2007> <Ioannis Kirmitzolgou>
>>>
>>> Permission to use, copy, modify, merge, publish and distribute
>>> this software and its documentation, with or without modification,
>>> for any purpose, and without fee or royalty to the copyright holder
>>> (s)
>>> is hereby granted with no restictions and/or prerequisites.
>>>
>>> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>>> EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
>>> MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
>>> NONINFRINGEMENT.
>>> IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
>>> CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
>>> TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
>>> SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
>>>
>>> =cut
>>>
>>> use strict;
>>> use Getopt::Long;
>>>
>>> my %data=();
>>>
>>> my $outfile=''; my $header='';
>>> GetOptions(
>>>     'header'              => \$header,
>>>     'o|out|outfile:s'     => \$outfile,
>>>     'h|help'              => sub { exec('perldoc',$0); exit; }
>>>        );
>>>
>>> my $outfh;
>>> if( $outfile ) {
>>>     open($outfh, ">$outfile") || die("$outfile: $!");
>>> } else {
>>>     $outfh = \*STDOUT;
>>> }
>>>
>>>
>>> $/="\n>>>";
>>>
>>> my @fields = qw(qname hname percid alen mmcount gapcount
>>>         qstart qend hstart hend evalue bits percsim qlen hlen qgap
>>> hgap);
>>>
>>> print $outfh "#",uc(join("", map{ sprintf("%-10s",$_) } @fields)),
>>> "\n" if
>>> $header;
>>>
>>> while (<>) {
>>>
>>>         chomp;
>>>         if ($_=~/^>/ || $_=~/^\#/) {next;}
>>>         my @hits = split(/\d+>>/, $_);
>>>         @hits= split("\n>>", $hits[0]);
>>>
>>>         my $hit = shift @hits;
>>>
>>>         ($data{'qname'}, $data{'qlen'} ) = ($hit=~ (/(\S+)\,\s(\d
>>> +)/));
>>>
>>>         foreach my $align (@hits) {
>>>
>>>             my @details= split ("\n>", $align);
>>>            my $detail = shift @details;
>>>             ($data{'hname'}) = ($detail =~ (/^(\S+)\s/));
>>>             $detail=~ /\;\s(?:fa|sw)\_bits\:\s+(\S+)/;
>>>             $data{'bits'}=$1;
>>>             $detail=~ /\;\s(?:fa|sw)\_expect\:\s+(\S+)/;
>>>             $data{'evalue'}=$1;
>>>
>>>             my $term = quotemeta("; sw_score");
>>>             $term =~ s/\\ /\\s/;
>>>             $detail=~ /$term\s+(\S+)/;
>>>             $data{'score'}=$1;
>>>
>>>             $term = quotemeta("; sw_ident:");
>>>             $term =~ s/\\ /\\s/;
>>>             $detail=~ /$term\s+(\S+)/;
>>>             $data{'percid'}=$1;
>>>
>>>             $term = quotemeta("; sw_sim:");
>>>             $term =~ s/\\ /\\s/;
>>>             $detail=~ /$term\s+(\S+)/;
>>>             $data{'percsim'}=$1;
>>>
>>>             $term = quotemeta("; sw_overlap:");
>>>             $term =~ s/\\ /\\s/;
>>>             $detail=~ /$term\s+(\S+)/;
>>>             $data{'alen'}=$1;
>>>
>>>             $detail = shift @details;
>>>
>>>             $term = quotemeta("; al_start:");
>>>             $term =~ s/\\ /\\s/;
>>>             $detail=~ /$term\s+(\S+)/;
>>>             $data{'qstart'}=$1;
>>>
>>>             $term = quotemeta("; al_stop:");
>>>             $term =~ s/\\ /\\s/;
>>>             $detail=~ /$term\s+(\S+)/;
>>>             $data{'qend'}=$1;
>>>
>>>             $term = quotemeta("; al_display_start:");
>>>             $term =~ s/\\ /\\s/;
>>>             my $lakis ='';
>>>             $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
>>>
>>>             $data{'qgap'}=($1 =~ tr/\-//);
>>>
>>>             $detail = shift @details;
>>>
>>>             $term = quotemeta("; sq_len:");
>>>             $term =~ s/\\ /\\s/;
>>>             $detail=~ /$term\s+(\S+)/;
>>>             $data{'hlen'}=$1;
>>>
>>>             $term = quotemeta("; al_start:");
>>>             $term =~ s/\\ /\\s/;
>>>             $detail=~ /$term\s+(\S+)/;
>>>             $data{'hstart'}=$1;
>>>
>>>             $term = quotemeta("; al_stop:");
>>>             $term =~ s/\\ /\\s/;
>>>             $detail=~ /$term\s+(\S+)/;
>>>             $data{'hend'}=$1;
>>>
>>>             $term = quotemeta("; al_display_start:");
>>>             $term =~ s/\\ /\\s/;
>>>             $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
>>>             $data{'hgap'}=($1 =~ tr/-//);
>>>             $data{'gapcount'} = $data{'qgap'} + $data{'hgap'};
>>>             $data{'mmcount'} = $data{'alen'} - ( int($data
>>> {'percid'} *
>>> $data{'alen'}) + $data{'gapcount'});
>>>
>>> for ( $data{'percid'}, $data{'percsim'} ) {
>>>     $_ = sprintf("%.2f",$_*100);
>>> }
>>>
>>>             print $outfh join( "\t",map { $data{$_} } @fields),"\n"
>>>         }
>>>
>>> }
>>>
>>> <----------------- CODE ENDS HERE ---------------------->
>>>
>>> -- 
>>>
>>> *Ioannis Kirmitzoglou*, MSc
>>> PhD. Student,
>>> Bioinformatics Research Laboratory
>>> Department of Biological Sciences
>>> University of Cyprus
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> -- 
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> --
> Jason Stajich
> jason at bioperl.org
> http://jason.open-bio.org/
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign