[Bioperl-l] Parsing FASTA m10 output

Ioannis Kirmitzoglou ioanniskirmitzoglou at gmail.com
Sun Apr 22 17:11:35 UTC 2007


I agree with Jason. Both scripts (fastam9_to_table and fastam10_to_table)
are way faster and easier to use than the searchIO. Still, there are a lot
of cases where searchIO support for m10 would be useful (e.g when trying to
represent the alignment in a graphical way).
Nevertheless I do think that FASTA needs an output similar to the BLAST m8
one which is really compact. Although I haven't tried it yet I do believe
that both scripts can be piped, so one easy and rather fast way to produce
an tabular output from FASTA would be to pipe its output directly to one of
the scripts.
-- 

*Ioannis Kirmitzoglou*, MSc
PhD. Student,
Bioinformatics Research Laboratory
Department of Biological Sciences
University of Cyprus



On 21/04/07, Chris Fields <cjfields at uiuc.edu> wrote:
>
> Ioannis's fastm10_to_table script is available in the bugzilla
> enhancement request (as an attachment) if anyone's interested:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2278
>
> I haven't had a chance to really look into m10 output yet but it
> looks easy enough to parse; may not be hard to get something SearchIO-
> based up and running.
>
> chris
>
> On Apr 21, 2007, at 12:44 PM, Jason Stajich wrote:
>
> > We don't have one yet. This is a new format introduced in the most
> > recent release of FASTA.  Hopefully someone can make some time to add
> > some code to SearchIO::fasta for it.
> >
> > I do find that I when I need a fast FASTA to TAB converter that the
> > simple script (fastam9_to_table) is more efficient that SearchIO
> > framework so Ioannis is making a parallel one for the new m10
> > output.  So I think having both is useful.
> >
> > -jason
> > On Apr 21, 2007, at 10:14 AM, Hilmar Lapp wrote:
> >
> >> I haven't kept track of this - did this go anywhere? Do we not have
> >> an -m10 fasta output parser in SearchIO? (I.e., my first thought
> >> would be that that would be the desired solution; am I misled in
> >> this?)
> >>
> >>      -hilmar
> >>
> >> On Apr 19, 2007, at 10:06 AM, Ioannis Kirmitzoglou wrote:
> >>
> >>> I have reported it as a bug on the bugzilla but due to bugzilla
> >>> problems I
> >>> was not able to attach my code and/or sample m10 files.
> >>> Nevertheless here is the code that converts an m10 fasta output to
> >>> an m8
> >>> BLAST output which is parseable by the vast majority of software.
> >>>
> >>> <----------- CODE BEGINS HERE ------------------->
> >>>
> >>> #!/usr/bin/perl -w
> >>>
> >>> =head1 NAME
> >>>
> >>> fastam10_to_table  - turn FASTA -m 10 output into NCBI -m 8 tabular
> >>> output
> >>>
> >>> =head1 SYNOPSIS
> >>>
> >>>  fastam10_to_table [--header] [-o outfile] inputfile1 inputfile2 ...
> >>>
> >>> =head1 DESCRIPTION
> >>>
> >>> Command line options:
> >>>   --header                -- boolean flag to print column header
> >>>   -o/--out                -- optional outputfile to write data,
> >>>                              otherwise will write to STDOUT
> >>>   -h/--help               -- show this documentation
> >>>
> >>> Not technically a SearchIO script as this doesn't use any Bioperl
> >>> components but is a useful and fast.  The output is tabular output
> >>> with the standard NCBI -m8 columns.
> >>>
> >>>  queryname
> >>>  hit name
> >>>  percent identity
> >>>  alignment length
> >>>  number mismatches
> >>>  number gaps
> >>>  query start  (if on rev-strand start > end)
> >>>  query end
> >>>  hit start (if on rev-strand start > end)
> >>>  hit end
> >>>  evalue
> >>>  bit score
> >>>
> >>> Additionally 4 more columns are provided
> >>>  percent similar
> >>>  query length
> >>>  hit length
> >>>  query gaps
> >>>  hit gaps
> >>>
> >>> =head1 AUTHOR - Ioannis Kirmitzoglou
> >>>
> >>> Ioannis Kirmitzoglou IoannisKirmitzoglou_at_gmail-dot-org
> >>>
> >>> =head1 ACKNOWLEDGMENTS - Ioannis Kirmitzoglou
> >>>
> >>> Headers as well as portions of code were taken
> >>>> from fastam9_to_table.pl by Jason Stajich
> >>>
> >>> =head1 DISCLAIMER
> >>>
> >>> Copyright (c) <2007> <Ioannis Kirmitzolgou>
> >>>
> >>> Permission to use, copy, modify, merge, publish and distribute
> >>> this software and its documentation, with or without modification,
> >>> for any purpose, and without fee or royalty to the copyright holder
> >>> (s)
> >>> is hereby granted with no restictions and/or prerequisites.
> >>>
> >>> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> >>> EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> >>> MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> >>> NONINFRINGEMENT.
> >>> IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
> >>> CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
> >>> TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
> >>> SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
> >>>
> >>> =cut
> >>>
> >>> use strict;
> >>> use Getopt::Long;
> >>>
> >>> my %data=();
> >>>
> >>> my $outfile=''; my $header='';
> >>> GetOptions(
> >>>     'header'              => \$header,
> >>>     'o|out|outfile:s'     => \$outfile,
> >>>     'h|help'              => sub { exec('perldoc',$0); exit; }
> >>>        );
> >>>
> >>> my $outfh;
> >>> if( $outfile ) {
> >>>     open($outfh, ">$outfile") || die("$outfile: $!");
> >>> } else {
> >>>     $outfh = \*STDOUT;
> >>> }
> >>>
> >>>
> >>> $/="\n>>>";
> >>>
> >>> my @fields = qw(qname hname percid alen mmcount gapcount
> >>>         qstart qend hstart hend evalue bits percsim qlen hlen qgap
> >>> hgap);
> >>>
> >>> print $outfh "#",uc(join("", map{ sprintf("%-10s",$_) } @fields)),
> >>> "\n" if
> >>> $header;
> >>>
> >>> while (<>) {
> >>>
> >>>         chomp;
> >>>         if ($_=~/^>/ || $_=~/^\#/) {next;}
> >>>         my @hits = split(/\d+>>/, $_);
> >>>         @hits= split("\n>>", $hits[0]);
> >>>
> >>>         my $hit = shift @hits;
> >>>
> >>>         ($data{'qname'}, $data{'qlen'} ) = ($hit=~ (/(\S+)\,\s(\d
> >>> +)/));
> >>>
> >>>         foreach my $align (@hits) {
> >>>
> >>>             my @details= split ("\n>", $align);
> >>>            my $detail = shift @details;
> >>>             ($data{'hname'}) = ($detail =~ (/^(\S+)\s/));
> >>>             $detail=~ /\;\s(?:fa|sw)\_bits\:\s+(\S+)/;
> >>>             $data{'bits'}=$1;
> >>>             $detail=~ /\;\s(?:fa|sw)\_expect\:\s+(\S+)/;
> >>>             $data{'evalue'}=$1;
> >>>
> >>>             my $term = quotemeta("; sw_score");
> >>>             $term =~ s/\\ /\\s/;
> >>>             $detail=~ /$term\s+(\S+)/;
> >>>             $data{'score'}=$1;
> >>>
> >>>             $term = quotemeta("; sw_ident:");
> >>>             $term =~ s/\\ /\\s/;
> >>>             $detail=~ /$term\s+(\S+)/;
> >>>             $data{'percid'}=$1;
> >>>
> >>>             $term = quotemeta("; sw_sim:");
> >>>             $term =~ s/\\ /\\s/;
> >>>             $detail=~ /$term\s+(\S+)/;
> >>>             $data{'percsim'}=$1;
> >>>
> >>>             $term = quotemeta("; sw_overlap:");
> >>>             $term =~ s/\\ /\\s/;
> >>>             $detail=~ /$term\s+(\S+)/;
> >>>             $data{'alen'}=$1;
> >>>
> >>>             $detail = shift @details;
> >>>
> >>>             $term = quotemeta("; al_start:");
> >>>             $term =~ s/\\ /\\s/;
> >>>             $detail=~ /$term\s+(\S+)/;
> >>>             $data{'qstart'}=$1;
> >>>
> >>>             $term = quotemeta("; al_stop:");
> >>>             $term =~ s/\\ /\\s/;
> >>>             $detail=~ /$term\s+(\S+)/;
> >>>             $data{'qend'}=$1;
> >>>
> >>>             $term = quotemeta("; al_display_start:");
> >>>             $term =~ s/\\ /\\s/;
> >>>             my $lakis ='';
> >>>             $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
> >>>
> >>>             $data{'qgap'}=($1 =~ tr/\-//);
> >>>
> >>>             $detail = shift @details;
> >>>
> >>>             $term = quotemeta("; sq_len:");
> >>>             $term =~ s/\\ /\\s/;
> >>>             $detail=~ /$term\s+(\S+)/;
> >>>             $data{'hlen'}=$1;
> >>>
> >>>             $term = quotemeta("; al_start:");
> >>>             $term =~ s/\\ /\\s/;
> >>>             $detail=~ /$term\s+(\S+)/;
> >>>             $data{'hstart'}=$1;
> >>>
> >>>             $term = quotemeta("; al_stop:");
> >>>             $term =~ s/\\ /\\s/;
> >>>             $detail=~ /$term\s+(\S+)/;
> >>>             $data{'hend'}=$1;
> >>>
> >>>             $term = quotemeta("; al_display_start:");
> >>>             $term =~ s/\\ /\\s/;
> >>>             $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
> >>>             $data{'hgap'}=($1 =~ tr/-//);
> >>>             $data{'gapcount'} = $data{'qgap'} + $data{'hgap'};
> >>>             $data{'mmcount'} = $data{'alen'} - ( int($data
> >>> {'percid'} *
> >>> $data{'alen'}) + $data{'gapcount'});
> >>>
> >>> for ( $data{'percid'}, $data{'percsim'} ) {
> >>>     $_ = sprintf("%.2f",$_*100);
> >>> }
> >>>
> >>>             print $outfh join( "\t",map { $data{$_} } @fields),"\n"
> >>>         }
> >>>
> >>> }
> >>>
> >>> <----------------- CODE ENDS HERE ---------------------->
> >>>
> >>> --
> >>>
> >>> *Ioannis Kirmitzoglou*, MSc
> >>> PhD. Student,
> >>> Bioinformatics Research Laboratory
> >>> Department of Biological Sciences
> >>> University of Cyprus
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >> --
> >> ===========================================================
> >> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> >> ===========================================================
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > --
> > Jason Stajich
> > jason at bioperl.org
> > http://jason.open-bio.org/
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
>



More information about the Bioperl-l mailing list