[Bioperl-l] Parsing FASTA m10 output
Jason Stajich
jason at bioperl.org
Sun Apr 22 20:24:23 UTC 2007
I do think that m9 is pretty compact if you don't need to see the
alignment and just want the pairwise statistics and is analogous to
BLAST m8/9 format. I typically just use that + fastam9_to_table for
input to MCL and other systems that can process tabular formats.
I cleaned up a few things in SearchIO::fasta but have not been able
to see whether we can auto-detect m10 format and insert the necessary
code just yet.
-jason
On Apr 22, 2007, at 10:11 AM, Ioannis Kirmitzoglou wrote:
> I agree with Jason. Both scripts (fastam9_to_table and
> fastam10_to_table)
> are way faster and easier to use than the searchIO. Still, there
> are a lot
> of cases where searchIO support for m10 would be useful (e.g when
> trying to
> represent the alignment in a graphical way).
> Nevertheless I do think that FASTA needs an output similar to the
> BLAST m8
> one which is really compact. Although I haven't tried it yet I do
> believe
> that both scripts can be piped, so one easy and rather fast way to
> produce
> an tabular output from FASTA would be to pipe its output directly
> to one of
> the scripts.
> --
>
> *Ioannis Kirmitzoglou*, MSc
> PhD. Student,
> Bioinformatics Research Laboratory
> Department of Biological Sciences
> University of Cyprus
>
>
>
> On 21/04/07, Chris Fields <cjfields at uiuc.edu> wrote:
>>
>> Ioannis's fastm10_to_table script is available in the bugzilla
>> enhancement request (as an attachment) if anyone's interested:
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2278
>>
>> I haven't had a chance to really look into m10 output yet but it
>> looks easy enough to parse; may not be hard to get something
>> SearchIO-
>> based up and running.
>>
>> chris
>>
>> On Apr 21, 2007, at 12:44 PM, Jason Stajich wrote:
>>
>> > We don't have one yet. This is a new format introduced in the most
>> > recent release of FASTA. Hopefully someone can make some time
>> to add
>> > some code to SearchIO::fasta for it.
>> >
>> > I do find that I when I need a fast FASTA to TAB converter that the
>> > simple script (fastam9_to_table) is more efficient that SearchIO
>> > framework so Ioannis is making a parallel one for the new m10
>> > output. So I think having both is useful.
>> >
>> > -jason
>> > On Apr 21, 2007, at 10:14 AM, Hilmar Lapp wrote:
>> >
>> >> I haven't kept track of this - did this go anywhere? Do we not
>> have
>> >> an -m10 fasta output parser in SearchIO? (I.e., my first thought
>> >> would be that that would be the desired solution; am I misled in
>> >> this?)
>> >>
>> >> -hilmar
>> >>
>> >> On Apr 19, 2007, at 10:06 AM, Ioannis Kirmitzoglou wrote:
>> >>
>> >>> I have reported it as a bug on the bugzilla but due to bugzilla
>> >>> problems I
>> >>> was not able to attach my code and/or sample m10 files.
>> >>> Nevertheless here is the code that converts an m10 fasta
>> output to
>> >>> an m8
>> >>> BLAST output which is parseable by the vast majority of software.
>> >>>
>> >>> <----------- CODE BEGINS HERE ------------------->
>> >>>
>> >>> #!/usr/bin/perl -w
>> >>>
>> >>> =head1 NAME
>> >>>
>> >>> fastam10_to_table - turn FASTA -m 10 output into NCBI -m 8
>> tabular
>> >>> output
>> >>>
>> >>> =head1 SYNOPSIS
>> >>>
>> >>> fastam10_to_table [--header] [-o outfile] inputfile1
>> inputfile2 ...
>> >>>
>> >>> =head1 DESCRIPTION
>> >>>
>> >>> Command line options:
>> >>> --header -- boolean flag to print column header
>> >>> -o/--out -- optional outputfile to write data,
>> >>> otherwise will write to STDOUT
>> >>> -h/--help -- show this documentation
>> >>>
>> >>> Not technically a SearchIO script as this doesn't use any Bioperl
>> >>> components but is a useful and fast. The output is tabular
>> output
>> >>> with the standard NCBI -m8 columns.
>> >>>
>> >>> queryname
>> >>> hit name
>> >>> percent identity
>> >>> alignment length
>> >>> number mismatches
>> >>> number gaps
>> >>> query start (if on rev-strand start > end)
>> >>> query end
>> >>> hit start (if on rev-strand start > end)
>> >>> hit end
>> >>> evalue
>> >>> bit score
>> >>>
>> >>> Additionally 4 more columns are provided
>> >>> percent similar
>> >>> query length
>> >>> hit length
>> >>> query gaps
>> >>> hit gaps
>> >>>
>> >>> =head1 AUTHOR - Ioannis Kirmitzoglou
>> >>>
>> >>> Ioannis Kirmitzoglou IoannisKirmitzoglou_at_gmail-dot-org
>> >>>
>> >>> =head1 ACKNOWLEDGMENTS - Ioannis Kirmitzoglou
>> >>>
>> >>> Headers as well as portions of code were taken
>> >>>> from fastam9_to_table.pl by Jason Stajich
>> >>>
>> >>> =head1 DISCLAIMER
>> >>>
>> >>> Copyright (c) <2007> <Ioannis Kirmitzolgou>
>> >>>
>> >>> Permission to use, copy, modify, merge, publish and distribute
>> >>> this software and its documentation, with or without
>> modification,
>> >>> for any purpose, and without fee or royalty to the copyright
>> holder
>> >>> (s)
>> >>> is hereby granted with no restictions and/or prerequisites.
>> >>>
>> >>> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>> >>> EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
>> WARRANTIES OF
>> >>> MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
>> >>> NONINFRINGEMENT.
>> >>> IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE
>> FOR ANY
>> >>> CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
>> CONTRACT,
>> >>> TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
>> >>> SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
>> >>>
>> >>> =cut
>> >>>
>> >>> use strict;
>> >>> use Getopt::Long;
>> >>>
>> >>> my %data=();
>> >>>
>> >>> my $outfile=''; my $header='';
>> >>> GetOptions(
>> >>> 'header' => \$header,
>> >>> 'o|out|outfile:s' => \$outfile,
>> >>> 'h|help' => sub { exec('perldoc',$0); exit; }
>> >>> );
>> >>>
>> >>> my $outfh;
>> >>> if( $outfile ) {
>> >>> open($outfh, ">$outfile") || die("$outfile: $!");
>> >>> } else {
>> >>> $outfh = \*STDOUT;
>> >>> }
>> >>>
>> >>>
>> >>> $/="\n>>>";
>> >>>
>> >>> my @fields = qw(qname hname percid alen mmcount gapcount
>> >>> qstart qend hstart hend evalue bits percsim qlen hlen
>> qgap
>> >>> hgap);
>> >>>
>> >>> print $outfh "#",uc(join("", map{ sprintf("%-10s",$_) }
>> @fields)),
>> >>> "\n" if
>> >>> $header;
>> >>>
>> >>> while (<>) {
>> >>>
>> >>> chomp;
>> >>> if ($_=~/^>/ || $_=~/^\#/) {next;}
>> >>> my @hits = split(/\d+>>/, $_);
>> >>> @hits= split("\n>>", $hits[0]);
>> >>>
>> >>> my $hit = shift @hits;
>> >>>
>> >>> ($data{'qname'}, $data{'qlen'} ) = ($hit=~ (/(\S+)\,\s(\d
>> >>> +)/));
>> >>>
>> >>> foreach my $align (@hits) {
>> >>>
>> >>> my @details= split ("\n>", $align);
>> >>> my $detail = shift @details;
>> >>> ($data{'hname'}) = ($detail =~ (/^(\S+)\s/));
>> >>> $detail=~ /\;\s(?:fa|sw)\_bits\:\s+(\S+)/;
>> >>> $data{'bits'}=$1;
>> >>> $detail=~ /\;\s(?:fa|sw)\_expect\:\s+(\S+)/;
>> >>> $data{'evalue'}=$1;
>> >>>
>> >>> my $term = quotemeta("; sw_score");
>> >>> $term =~ s/\\ /\\s/;
>> >>> $detail=~ /$term\s+(\S+)/;
>> >>> $data{'score'}=$1;
>> >>>
>> >>> $term = quotemeta("; sw_ident:");
>> >>> $term =~ s/\\ /\\s/;
>> >>> $detail=~ /$term\s+(\S+)/;
>> >>> $data{'percid'}=$1;
>> >>>
>> >>> $term = quotemeta("; sw_sim:");
>> >>> $term =~ s/\\ /\\s/;
>> >>> $detail=~ /$term\s+(\S+)/;
>> >>> $data{'percsim'}=$1;
>> >>>
>> >>> $term = quotemeta("; sw_overlap:");
>> >>> $term =~ s/\\ /\\s/;
>> >>> $detail=~ /$term\s+(\S+)/;
>> >>> $data{'alen'}=$1;
>> >>>
>> >>> $detail = shift @details;
>> >>>
>> >>> $term = quotemeta("; al_start:");
>> >>> $term =~ s/\\ /\\s/;
>> >>> $detail=~ /$term\s+(\S+)/;
>> >>> $data{'qstart'}=$1;
>> >>>
>> >>> $term = quotemeta("; al_stop:");
>> >>> $term =~ s/\\ /\\s/;
>> >>> $detail=~ /$term\s+(\S+)/;
>> >>> $data{'qend'}=$1;
>> >>>
>> >>> $term = quotemeta("; al_display_start:");
>> >>> $term =~ s/\\ /\\s/;
>> >>> my $lakis ='';
>> >>> $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
>> >>>
>> >>> $data{'qgap'}=($1 =~ tr/\-//);
>> >>>
>> >>> $detail = shift @details;
>> >>>
>> >>> $term = quotemeta("; sq_len:");
>> >>> $term =~ s/\\ /\\s/;
>> >>> $detail=~ /$term\s+(\S+)/;
>> >>> $data{'hlen'}=$1;
>> >>>
>> >>> $term = quotemeta("; al_start:");
>> >>> $term =~ s/\\ /\\s/;
>> >>> $detail=~ /$term\s+(\S+)/;
>> >>> $data{'hstart'}=$1;
>> >>>
>> >>> $term = quotemeta("; al_stop:");
>> >>> $term =~ s/\\ /\\s/;
>> >>> $detail=~ /$term\s+(\S+)/;
>> >>> $data{'hend'}=$1;
>> >>>
>> >>> $term = quotemeta("; al_display_start:");
>> >>> $term =~ s/\\ /\\s/;
>> >>> $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
>> >>> $data{'hgap'}=($1 =~ tr/-//);
>> >>> $data{'gapcount'} = $data{'qgap'} + $data{'hgap'};
>> >>> $data{'mmcount'} = $data{'alen'} - ( int($data
>> >>> {'percid'} *
>> >>> $data{'alen'}) + $data{'gapcount'});
>> >>>
>> >>> for ( $data{'percid'}, $data{'percsim'} ) {
>> >>> $_ = sprintf("%.2f",$_*100);
>> >>> }
>> >>>
>> >>> print $outfh join( "\t",map { $data{$_} }
>> @fields),"\n"
>> >>> }
>> >>>
>> >>> }
>> >>>
>> >>> <----------------- CODE ENDS HERE ---------------------->
>> >>>
>> >>> --
>> >>>
>> >>> *Ioannis Kirmitzoglou*, MSc
>> >>> PhD. Student,
>> >>> Bioinformatics Research Laboratory
>> >>> Department of Biological Sciences
>> >>> University of Cyprus
>> >>> _______________________________________________
>> >>> Bioperl-l mailing list
>> >>> Bioperl-l at lists.open-bio.org
>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >>
>> >> --
>> >> ===========================================================
>> >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
>> >> ===========================================================
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Bioperl-l mailing list
>> >> Bioperl-l at lists.open-bio.org
>> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >
>> > --
>> > Jason Stajich
>> > jason at bioperl.org
>> > http://jason.open-bio.org/
>> >
>> >
>> > _______________________________________________
>> > Bioperl-l mailing list
>> > Bioperl-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>
>>
--
Jason Stajich
jason at bioperl.org
http://jason.open-bio.org/
More information about the Bioperl-l
mailing list