[Bioperl-l] Converting blast+ output to gff (with gaps)

Jim Hu jimhu at tamu.edu
Fri Jan 4 21:57:38 UTC 2013


Malcolm,

Thanks, I should have reread the GFF3 spec before posting!

In the section on the Gap attrribute and below on alignments it discusses two ways to represent an alignment. I was originally thinking of something like the later example shown for cDNA vs genome. But the gap attribute representation would be fine too. So, I can see how the final output could be done in different ways, but I'm still stuck on how to get there.  

I don't have a specific application in mind; I'm mostly just trying to understand how to get from having standalone blast+ output to get to things that look like the examples in the gff spec and the gbrowse documentation - really basic display of alignments that are gapped. For my teaching, we do EST vs genomic blast and want gapped cDNA alignments to show where the introns go. My other work is with bacteria where introns are rare, but there are times when I'd like to show an alignment that is interrupted by a transposable element, for example.

Excerpting from blastp -help

 *** Formatting options
 -outfmt <String>
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1) 

Several of these are "lossy" in terms of where the actual gaps occur (e.g. 6). Others seem to me to be more human readable than suited for parsing. So I was hoping to get pointed to an existing script that would generate either the single feature with gap attribute OR the multi-line match features OR a combination from one of these output formats. 

I'm probably missing something very, very obvious.

Best,

Jim


On Jan 4, 2013, at 2:20 PM, Cook, Malcolm wrote:

> Jim,
> 
> Getting to your original question:
> 
>> I'm looking for a script that will take one of the blast+ outformats that includes the positions of gaps and mismatches, and .create gff with appropriate subfeatures.
> 
> Exactly what/how do you want/expect to encode the blast output as GFF{1,2,2.5,3}??
> 
> If GFF3 pe http://www.sequenceontology.org/gff3.shtml then are you hoping to get GFF3 marked up as described in section 'THE GAP ATTRIBUTE' or as in 'ALIGNMENTS'
> 
> I would guess not because neither of them have 'subfeatures'.
> 
> If you could explain more fully with examples (hand cobbled or borrowed from someone else) of what you expect then I might have a better idea of what options might suit your needs.
> 
> 
> ~Malcolm
> 
> 
> .-----Original Message-----
> .From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Jim Hu
> .Sent: Friday, January 04, 2013 1:50 PM
> .To: Brian Osborne
> .Cc: Fields, Christopher J; Scott Cain; bioperl-l at bioperl.org
> .Subject: Re: [Bioperl-l] Converting blast+ output to gff (with gaps)
> .
> .Thanks for the replies, but...
> .
> .I can't tell what input formats for the blast results file are supported.  Format 11 and format 6 give no output and no feedback. Putting
> .some diagnostic print statements in the code suggests that I'm not getting any result objects from Bio::SearchIO.
> .
> .The script uses Bio::SearchIO, but does not seem to call the submodules for blast.  Documentation links on the wiki seem to be
> .broken, at least on this page:
> .
> .	http://www.bioperl.org/wiki/Module:Bio::SearchIO
> .
> .Jim
> .
> .
> .On Jan 2, 2013, at 4:53 PM, Brian Osborne wrote:
> .
> .> Scott and Chris,
> .>
> .> I'll test it and see...
> .>
> .> Brian O.
> .>
> .>
> .> On Jan 2, 2013, at 5:26 PM, "Fields, Christopher J" <cjfields at illinois.edu> wrote:
> .>
> .>> It should (I recall using it at one point).  If it doesn't we should fix it so it does.
> .>>
> .>> How does MAKER deal with this?  IIRC it uses (a modified) SearchIO-based method...
> .>>
> .>> chris
> .>>
> .>> On Jan 2, 2013, at 3:32 PM, Scott Cain <scott at scottcain.net> wrote:
> .>>
> .>>> Hi Brian,
> .>>>
> .>>> I was going to suggest the same thing--though that script is fairly
> .>>> old, it's not as old as the blast2gff script in the GBrowse
> .>>> distribution (which probably should be retired).  I believe it
> .>>> supports GFF3, though I don't have any sample data with which to test
> .>>> it to be sure.  I also don't know if it supports BLAST+ input--I
> .>>> haven't kept up with SearchIO (on which search2gff.pl depends); will
> .>>> it accept it?
> .>>>
> .>>> Scott
> .>>>
> .>>>
> .>>> On Wed, Jan 2, 2013 at 3:26 PM, Brian Osborne <bosborne11 at verizon.net> wrote:
> .>>>> Here's one:
> .>>>>
> .>>>> https://github.com/GMOD/GBrowse/blob/master/contrib/blast2gff.pl
> .>>>>
> .>>>> Another one:
> .>>>>
> .>>>> ~/git/bioperl-live>head scripts/utilities/bp_search2gff.pl
> .>>>> #!perl
> .>>>>
> .>>>> # Author:      Jason Stajich <jason-at-bioperl-dot-org>
> .>>>> # Description: Turn SearchIO parseable report(s) into a GFF report
> .>>>> #
> .>>>> =head1 NAME
> .>>>>
> .>>>> bp_search2gff - Turn SearchIO parseable reports(s) into a GFF report
> .>>>>
> .>>>>
> .>>>>
> .>>>> Brian O.
> .>>>>
> .>>>> On Jan 2, 2013, at 2:44 PM, Jim Hu <jimhu at tamu.edu> wrote:
> .>>>>
> .>>>>> I assume this has already been done many times, but I can't seem to find it on bioperl.org or via google.
> .>>>>>
> .>>>>> I'm looking for a script that will take one of the blast+ outformats that includes the positions of gaps and mismatches, and
> .create gff with appropriate subfeatures.
> .>>>>>
> .>>>>> Thanks,
> .>>>>>
> .>>>>> Jim
> .>>>>> =====================================
> .>>>>> Jim Hu
> .>>>>> Professor
> .>>>>> Dept. of Biochemistry and Biophysics
> .>>>>> 2128 TAMU
> .>>>>> Texas A&M Univ.
> .>>>>> College Station, TX 77843-2128
> .>>>>> 979-862-4054
> .>>>>>
> .>>>>>
> .>>>>>
> .>>>>> _______________________________________________
> .>>>>> Bioperl-l mailing list
> .>>>>> Bioperl-l at lists.open-bio.org
> .>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> .>>>>
> .>>>>
> .>>>> _______________________________________________
> .>>>> Bioperl-l mailing list
> .>>>> Bioperl-l at lists.open-bio.org
> .>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> .>>>
> .>>>
> .>>>
> .>>> --
> .>>> ------------------------------------------------------------------------
> .>>> Scott Cain, Ph. D.                                   scott at scottcain dot net
> .>>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> .>>> Ontario Institute for Cancer Research
> .>>> _______________________________________________
> .>>> Bioperl-l mailing list
> .>>> Bioperl-l at lists.open-bio.org
> .>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> .>>
> .>
> .
> .=====================================
> .Jim Hu
> .Professor
> .Dept. of Biochemistry and Biophysics
> .2128 TAMU
> .Texas A&M Univ.
> .College Station, TX 77843-2128
> .979-862-4054
> .
> .
> .
> ._______________________________________________
> .Bioperl-l mailing list
> .Bioperl-l at lists.open-bio.org
> .http://lists.open-bio.org/mailman/listinfo/bioperl-l

=====================================
Jim Hu
Professor
Dept. of Biochemistry and Biophysics
2128 TAMU
Texas A&M Univ.
College Station, TX 77843-2128
979-862-4054






More information about the Bioperl-l mailing list