[Bioperl-l] Parsing FASTA m10 output
Chris Fields
cjfields at uiuc.edu
Sat Apr 21 20:09:48 UTC 2007
Ioannis's fastm10_to_table script is available in the bugzilla
enhancement request (as an attachment) if anyone's interested:
http://bugzilla.open-bio.org/show_bug.cgi?id=2278
I haven't had a chance to really look into m10 output yet but it
looks easy enough to parse; may not be hard to get something SearchIO-
based up and running.
chris
On Apr 21, 2007, at 12:44 PM, Jason Stajich wrote:
> We don't have one yet. This is a new format introduced in the most
> recent release of FASTA. Hopefully someone can make some time to add
> some code to SearchIO::fasta for it.
>
> I do find that I when I need a fast FASTA to TAB converter that the
> simple script (fastam9_to_table) is more efficient that SearchIO
> framework so Ioannis is making a parallel one for the new m10
> output. So I think having both is useful.
>
> -jason
> On Apr 21, 2007, at 10:14 AM, Hilmar Lapp wrote:
>
>> I haven't kept track of this - did this go anywhere? Do we not have
>> an -m10 fasta output parser in SearchIO? (I.e., my first thought
>> would be that that would be the desired solution; am I misled in
>> this?)
>>
>> -hilmar
>>
>> On Apr 19, 2007, at 10:06 AM, Ioannis Kirmitzoglou wrote:
>>
>>> I have reported it as a bug on the bugzilla but due to bugzilla
>>> problems I
>>> was not able to attach my code and/or sample m10 files.
>>> Nevertheless here is the code that converts an m10 fasta output to
>>> an m8
>>> BLAST output which is parseable by the vast majority of software.
>>>
>>> <----------- CODE BEGINS HERE ------------------->
>>>
>>> #!/usr/bin/perl -w
>>>
>>> =head1 NAME
>>>
>>> fastam10_to_table - turn FASTA -m 10 output into NCBI -m 8 tabular
>>> output
>>>
>>> =head1 SYNOPSIS
>>>
>>> fastam10_to_table [--header] [-o outfile] inputfile1 inputfile2 ...
>>>
>>> =head1 DESCRIPTION
>>>
>>> Command line options:
>>> --header -- boolean flag to print column header
>>> -o/--out -- optional outputfile to write data,
>>> otherwise will write to STDOUT
>>> -h/--help -- show this documentation
>>>
>>> Not technically a SearchIO script as this doesn't use any Bioperl
>>> components but is a useful and fast. The output is tabular output
>>> with the standard NCBI -m8 columns.
>>>
>>> queryname
>>> hit name
>>> percent identity
>>> alignment length
>>> number mismatches
>>> number gaps
>>> query start (if on rev-strand start > end)
>>> query end
>>> hit start (if on rev-strand start > end)
>>> hit end
>>> evalue
>>> bit score
>>>
>>> Additionally 4 more columns are provided
>>> percent similar
>>> query length
>>> hit length
>>> query gaps
>>> hit gaps
>>>
>>> =head1 AUTHOR - Ioannis Kirmitzoglou
>>>
>>> Ioannis Kirmitzoglou IoannisKirmitzoglou_at_gmail-dot-org
>>>
>>> =head1 ACKNOWLEDGMENTS - Ioannis Kirmitzoglou
>>>
>>> Headers as well as portions of code were taken
>>>> from fastam9_to_table.pl by Jason Stajich
>>>
>>> =head1 DISCLAIMER
>>>
>>> Copyright (c) <2007> <Ioannis Kirmitzolgou>
>>>
>>> Permission to use, copy, modify, merge, publish and distribute
>>> this software and its documentation, with or without modification,
>>> for any purpose, and without fee or royalty to the copyright holder
>>> (s)
>>> is hereby granted with no restictions and/or prerequisites.
>>>
>>> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>>> EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
>>> MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
>>> NONINFRINGEMENT.
>>> IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
>>> CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
>>> TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
>>> SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
>>>
>>> =cut
>>>
>>> use strict;
>>> use Getopt::Long;
>>>
>>> my %data=();
>>>
>>> my $outfile=''; my $header='';
>>> GetOptions(
>>> 'header' => \$header,
>>> 'o|out|outfile:s' => \$outfile,
>>> 'h|help' => sub { exec('perldoc',$0); exit; }
>>> );
>>>
>>> my $outfh;
>>> if( $outfile ) {
>>> open($outfh, ">$outfile") || die("$outfile: $!");
>>> } else {
>>> $outfh = \*STDOUT;
>>> }
>>>
>>>
>>> $/="\n>>>";
>>>
>>> my @fields = qw(qname hname percid alen mmcount gapcount
>>> qstart qend hstart hend evalue bits percsim qlen hlen qgap
>>> hgap);
>>>
>>> print $outfh "#",uc(join("", map{ sprintf("%-10s",$_) } @fields)),
>>> "\n" if
>>> $header;
>>>
>>> while (<>) {
>>>
>>> chomp;
>>> if ($_=~/^>/ || $_=~/^\#/) {next;}
>>> my @hits = split(/\d+>>/, $_);
>>> @hits= split("\n>>", $hits[0]);
>>>
>>> my $hit = shift @hits;
>>>
>>> ($data{'qname'}, $data{'qlen'} ) = ($hit=~ (/(\S+)\,\s(\d
>>> +)/));
>>>
>>> foreach my $align (@hits) {
>>>
>>> my @details= split ("\n>", $align);
>>> my $detail = shift @details;
>>> ($data{'hname'}) = ($detail =~ (/^(\S+)\s/));
>>> $detail=~ /\;\s(?:fa|sw)\_bits\:\s+(\S+)/;
>>> $data{'bits'}=$1;
>>> $detail=~ /\;\s(?:fa|sw)\_expect\:\s+(\S+)/;
>>> $data{'evalue'}=$1;
>>>
>>> my $term = quotemeta("; sw_score");
>>> $term =~ s/\\ /\\s/;
>>> $detail=~ /$term\s+(\S+)/;
>>> $data{'score'}=$1;
>>>
>>> $term = quotemeta("; sw_ident:");
>>> $term =~ s/\\ /\\s/;
>>> $detail=~ /$term\s+(\S+)/;
>>> $data{'percid'}=$1;
>>>
>>> $term = quotemeta("; sw_sim:");
>>> $term =~ s/\\ /\\s/;
>>> $detail=~ /$term\s+(\S+)/;
>>> $data{'percsim'}=$1;
>>>
>>> $term = quotemeta("; sw_overlap:");
>>> $term =~ s/\\ /\\s/;
>>> $detail=~ /$term\s+(\S+)/;
>>> $data{'alen'}=$1;
>>>
>>> $detail = shift @details;
>>>
>>> $term = quotemeta("; al_start:");
>>> $term =~ s/\\ /\\s/;
>>> $detail=~ /$term\s+(\S+)/;
>>> $data{'qstart'}=$1;
>>>
>>> $term = quotemeta("; al_stop:");
>>> $term =~ s/\\ /\\s/;
>>> $detail=~ /$term\s+(\S+)/;
>>> $data{'qend'}=$1;
>>>
>>> $term = quotemeta("; al_display_start:");
>>> $term =~ s/\\ /\\s/;
>>> my $lakis ='';
>>> $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
>>>
>>> $data{'qgap'}=($1 =~ tr/\-//);
>>>
>>> $detail = shift @details;
>>>
>>> $term = quotemeta("; sq_len:");
>>> $term =~ s/\\ /\\s/;
>>> $detail=~ /$term\s+(\S+)/;
>>> $data{'hlen'}=$1;
>>>
>>> $term = quotemeta("; al_start:");
>>> $term =~ s/\\ /\\s/;
>>> $detail=~ /$term\s+(\S+)/;
>>> $data{'hstart'}=$1;
>>>
>>> $term = quotemeta("; al_stop:");
>>> $term =~ s/\\ /\\s/;
>>> $detail=~ /$term\s+(\S+)/;
>>> $data{'hend'}=$1;
>>>
>>> $term = quotemeta("; al_display_start:");
>>> $term =~ s/\\ /\\s/;
>>> $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
>>> $data{'hgap'}=($1 =~ tr/-//);
>>> $data{'gapcount'} = $data{'qgap'} + $data{'hgap'};
>>> $data{'mmcount'} = $data{'alen'} - ( int($data
>>> {'percid'} *
>>> $data{'alen'}) + $data{'gapcount'});
>>>
>>> for ( $data{'percid'}, $data{'percsim'} ) {
>>> $_ = sprintf("%.2f",$_*100);
>>> }
>>>
>>> print $outfh join( "\t",map { $data{$_} } @fields),"\n"
>>> }
>>>
>>> }
>>>
>>> <----------------- CODE ENDS HERE ---------------------->
>>>
>>> --
>>>
>>> *Ioannis Kirmitzoglou*, MSc
>>> PhD. Student,
>>> Bioinformatics Research Laboratory
>>> Department of Biological Sciences
>>> University of Cyprus
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> --
>> ===========================================================
>> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> --
> Jason Stajich
> jason at bioperl.org
> http://jason.open-bio.org/
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list