[Bioperl-l] Converting blast+ output to gff (with gaps)
Cook, Malcolm
MEC at stowers.org
Fri Jan 4 23:33:04 UTC 2013
Jim,
To get a working example for further discussion....
Assuming you have bioperl and blast+ installed....
This command gets a known transcript from ncbi and blasts it back at ncbi into fly genome:
> bp_download_query_genbank.pl --query 'NM_001259364' | blastn -remote -db refseq_genomic -entrez_query 'melanogaster[Taxid]' -outfmt 6 > test.blast.tab
Take a look at the results (see below, or run it yourself).
First line output looks like this:
NM_001259364 gi|116010442|ref|NT_033778.3| 100.00 2127 0 0 1869 3995 9330023 9327897 0.0 3928
Old GFF did not have parent-child. Depending on where you are going with this, you MIGHT want to use this strategy: http://gmod.org/wiki/GFF#Alignments
Where the ID in column 9 serves to group the features. There is an 'implicit' parent.
I think this would be respected by both GBrowse and UCSC genome browser (but I'm not positive).
If so, converting blast output would yield one line output per each line input.
You would need to assign a strand, swap sstart and send where sstart>send, and some slight re-formatting
First line of GFF might look like:
NT_033778.3 est EST_match 9330023 9327897 . + . ID=Match1;Name= NM_001259364;Target= NM_001259364 1869 3995
Is this along the lines of what you need?
~Malcolm
Complete example blast+ output looks like this:
NM_001259364 gi|116010442|ref|NT_033778.3| 100.00 2127 0 0 1869 3995 9330023 9327897 0.0 3928
NM_001259364 gi|116010442|ref|NT_033778.3| 100.00 1019 0 0 3995 5013 9327834 9326816 0.0 1882
NM_001259364 gi|116010442|ref|NT_033778.3| 100.00 374 0 0 345 718 9332643 9332270 0.0 691
NM_001259364 gi|116010442|ref|NT_033778.3| 100.00 300 0 0 1 300 9337978 9337679 1e-154 555
NM_001259364 gi|116010442|ref|NT_033778.3| 100.00 270 0 0 717 986 9332201 9331932 5e-138 499
NM_001259364 gi|116010442|ref|NT_033778.3| 100.00 243 0 0 1434 1676 9331200 9330958 5e-123 449
NM_001259364 gi|116010442|ref|NT_033778.3| 99.55 223 0 1 1154 1376 9331611 9331390 1e-109 405
NM_001259364 gi|116010442|ref|NT_033778.3| 100.00 196 0 0 1675 1870 9330634 9330439 7e-97 363
NM_001259364 gi|116010442|ref|NT_033778.3| 100.00 169 0 0 986 1154 9331847 9331679 7e-82 313
NM_001259364 gi|116010442|ref|NT_033778.3| 97.26 73 2 0 1372 1444 9331322 9331250 3e-25 124
NM_001259364 gi|116010442|ref|NT_033778.3| 100.00 44 0 0 301 344 9333095 9333052 2e-12 82.4
NM_001259364 gi|195586644|ref|NT_167067.1| 96.52 2127 74 0 1869 3995 7777827 7775701 0.0 3518
NM_001259364 gi|195586644|ref|NT_167067.1| 97.03 404 7 3 4004 4402 7775629 7775226 0.0 675
NM_001259364 gi|195586644|ref|NT_167067.1| 93.58 374 24 0 345 718 7780392 7780019 8e-156 558
NM_001259364 gi|195586644|ref|NT_167067.1| 94.37 302 15 1 1 300 7786262 7785961 6e-127 462
NM_001259364 gi|195586644|ref|NT_167067.1| 96.25 267 10 0 720 986 7779949 7779683 1e-119 438
NM_001259364 gi|195586644|ref|NT_167067.1| 95.00 240 12 0 1434 1673 7778959 7778720 2e-101 377
NM_001259364 gi|195586644|ref|NT_167067.1| 96.86 223 6 1 1154 1376 7779370 7779149 1e-99 372
NM_001259364 gi|195586644|ref|NT_167067.1| 93.88 196 12 0 1675 1870 7778384 7778189 7e-77 296
NM_001259364 gi|195586644|ref|NT_167067.1| 95.19 187 5 4 4827 5013 7774635 7774453 9e-76 292
NM_001259364 gi|195586644|ref|NT_167067.1| 92.86 168 12 0 987 1154 7779598 7779431 3e-61 244
NM_001259364 gi|195586644|ref|NT_167067.1| 92.86 70 5 0 1374 1443 7779079 7779010 2e-18 102
NM_001259364 gi|195586644|ref|NT_167067.1| 100.00 44 0 0 301 344 7780833 7780790 2e-12 82.4
NM_001259364 gi|195489961|ref|NT_167063.1| 92.43 2127 161 0 1869 3995 7628910 7631036 0.0 3037
NM_001259364 gi|195489961|ref|NT_167063.1| 91.94 1030 38 20 3995 5013 7631102 7632097 0.0 1400
NM_001259364 gi|195489961|ref|NT_167063.1| 88.56 376 39 4 345 718 7626215 7626588 4e-124 453
NM_001259364 gi|195489961|ref|NT_167063.1| 93.67 300 18 1 1 300 7620777 7621075 2e-122 448
NM_001259364 gi|195489961|ref|NT_167063.1| 95.41 196 9 0 1675 1870 7628263 7628458 7e-82 313
NM_001259364 gi|195489961|ref|NT_167063.1| 91.93 223 17 1 1154 1376 7627224 7627445 2e-81 311
NM_001259364 gi|195489961|ref|NT_167063.1| 86.42 265 31 3 720 983 7626653 7626913 1e-73 285
NM_001259364 gi|195489961|ref|NT_167063.1| 86.18 246 25 4 1434 1673 7627635 7627877 3e-65 257
NM_001259364 gi|195489961|ref|NT_167063.1| 88.27 162 19 0 993 1154 7626999 7627160 3e-46 195
NM_001259364 gi|195489961|ref|NT_167063.1| 93.06 72 5 0 1372 1443 7627513 7627584 1e-19 106
NM_001259364 gi|195489961|ref|NT_167063.1| 100.00 44 0 0 301 344 7625770 7625813 2e-12 82.4
~Malcolm
.-----Original Message-----
.From: Jim Hu [mailto:jimhu at tamu.edu]
.Sent: Friday, January 04, 2013 3:58 PM
.To: Cook, Malcolm
.Cc: 'Brian Osborne'; 'Fields, Christopher J'; 'Scott Cain'; 'bioperl-l at bioperl.org'
.Subject: Re: [Bioperl-l] Converting blast+ output to gff (with gaps)
.
.Malcolm,
.
.Thanks, I should have reread the GFF3 spec before posting!
.
.In the section on the Gap attrribute and below on alignments it discusses two ways to represent an alignment. I was originally thinking
.of something like the later example shown for cDNA vs genome. But the gap attribute representation would be fine too. So, I can see
.how the final output could be done in different ways, but I'm still stuck on how to get there.
.
.I don't have a specific application in mind; I'm mostly just trying to understand how to get from having standalone blast+ output to get
.to things that look like the examples in the gff spec and the gbrowse documentation - really basic display of alignments that are
.gapped. For my teaching, we do EST vs genomic blast and want gapped cDNA alignments to show where the introns go. My other
.work is with bacteria where introns are rare, but there are times when I'd like to show an alignment that is interrupted by a
.transposable element, for example.
.
.Excerpting from blastp -help
.
. *** Formatting options
. -outfmt <String>
. alignment view options:
. 0 = pairwise,
. 1 = query-anchored showing identities,
. 2 = query-anchored no identities,
. 3 = flat query-anchored, show identities,
. 4 = flat query-anchored, no identities,
. 5 = XML Blast output,
. 6 = tabular,
. 7 = tabular with comment lines,
. 8 = Text ASN.1,
. 9 = Binary ASN.1,
. 10 = Comma-separated values,
. 11 = BLAST archive format (ASN.1)
.
.Several of these are "lossy" in terms of where the actual gaps occur (e.g. 6). Others seem to me to be more human readable than
.suited for parsing. So I was hoping to get pointed to an existing script that would generate either the single feature with gap attribute
.OR the multi-line match features OR a combination from one of these output formats.
.
.I'm probably missing something very, very obvious.
.
.Best,
.
.Jim
.
.
.On Jan 4, 2013, at 2:20 PM, Cook, Malcolm wrote:
.
.> Jim,
.>
.> Getting to your original question:
.>
.>> I'm looking for a script that will take one of the blast+ outformats that includes the positions of gaps and mismatches, and .create
.gff with appropriate subfeatures.
.>
.> Exactly what/how do you want/expect to encode the blast output as GFF{1,2,2.5,3}??
.>
.> If GFF3 pe http://www.sequenceontology.org/gff3.shtml then are you hoping to get GFF3 marked up as described in section 'THE
.GAP ATTRIBUTE' or as in 'ALIGNMENTS'
.>
.> I would guess not because neither of them have 'subfeatures'.
.>
.> If you could explain more fully with examples (hand cobbled or borrowed from someone else) of what you expect then I might have
.a better idea of what options might suit your needs.
.>
.>
.> ~Malcolm
.>
.>
.> .-----Original Message-----
.> .From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Jim Hu
.> .Sent: Friday, January 04, 2013 1:50 PM
.> .To: Brian Osborne
.> .Cc: Fields, Christopher J; Scott Cain; bioperl-l at bioperl.org
.> .Subject: Re: [Bioperl-l] Converting blast+ output to gff (with gaps)
.> .
.> .Thanks for the replies, but...
.> .
.> .I can't tell what input formats for the blast results file are supported. Format 11 and format 6 give no output and no feedback.
.Putting
.> .some diagnostic print statements in the code suggests that I'm not getting any result objects from Bio::SearchIO.
.> .
.> .The script uses Bio::SearchIO, but does not seem to call the submodules for blast. Documentation links on the wiki seem to be
.> .broken, at least on this page:
.> .
.> . http://www.bioperl.org/wiki/Module:Bio::SearchIO
.> .
.> .Jim
.> .
.> .
.> .On Jan 2, 2013, at 4:53 PM, Brian Osborne wrote:
.> .
.> .> Scott and Chris,
.> .>
.> .> I'll test it and see...
.> .>
.> .> Brian O.
.> .>
.> .>
.> .> On Jan 2, 2013, at 5:26 PM, "Fields, Christopher J" <cjfields at illinois.edu> wrote:
.> .>
.> .>> It should (I recall using it at one point). If it doesn't we should fix it so it does.
.> .>>
.> .>> How does MAKER deal with this? IIRC it uses (a modified) SearchIO-based method...
.> .>>
.> .>> chris
.> .>>
.> .>> On Jan 2, 2013, at 3:32 PM, Scott Cain <scott at scottcain.net> wrote:
.> .>>
.> .>>> Hi Brian,
.> .>>>
.> .>>> I was going to suggest the same thing--though that script is fairly
.> .>>> old, it's not as old as the blast2gff script in the GBrowse
.> .>>> distribution (which probably should be retired). I believe it
.> .>>> supports GFF3, though I don't have any sample data with which to test
.> .>>> it to be sure. I also don't know if it supports BLAST+ input--I
.> .>>> haven't kept up with SearchIO (on which search2gff.pl depends); will
.> .>>> it accept it?
.> .>>>
.> .>>> Scott
.> .>>>
.> .>>>
.> .>>> On Wed, Jan 2, 2013 at 3:26 PM, Brian Osborne <bosborne11 at verizon.net> wrote:
.> .>>>> Here's one:
.> .>>>>
.> .>>>> https://github.com/GMOD/GBrowse/blob/master/contrib/blast2gff.pl
.> .>>>>
.> .>>>> Another one:
.> .>>>>
.> .>>>> ~/git/bioperl-live>head scripts/utilities/bp_search2gff.pl
.> .>>>> #!perl
.> .>>>>
.> .>>>> # Author: Jason Stajich <jason-at-bioperl-dot-org>
.> .>>>> # Description: Turn SearchIO parseable report(s) into a GFF report
.> .>>>> #
.> .>>>> =head1 NAME
.> .>>>>
.> .>>>> bp_search2gff - Turn SearchIO parseable reports(s) into a GFF report
.> .>>>>
.> .>>>>
.> .>>>>
.> .>>>> Brian O.
.> .>>>>
.> .>>>> On Jan 2, 2013, at 2:44 PM, Jim Hu <jimhu at tamu.edu> wrote:
.> .>>>>
.> .>>>>> I assume this has already been done many times, but I can't seem to find it on bioperl.org or via google.
.> .>>>>>
.> .>>>>> I'm looking for a script that will take one of the blast+ outformats that includes the positions of gaps and mismatches, and
.> .create gff with appropriate subfeatures.
.> .>>>>>
.> .>>>>> Thanks,
.> .>>>>>
.> .>>>>> Jim
.> .>>>>> =====================================
.> .>>>>> Jim Hu
.> .>>>>> Professor
.> .>>>>> Dept. of Biochemistry and Biophysics
.> .>>>>> 2128 TAMU
.> .>>>>> Texas A&M Univ.
.> .>>>>> College Station, TX 77843-2128
.> .>>>>> 979-862-4054
.> .>>>>>
.> .>>>>>
.> .>>>>>
.> .>>>>> _______________________________________________
.> .>>>>> Bioperl-l mailing list
.> .>>>>> Bioperl-l at lists.open-bio.org
.> .>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
.> .>>>>
.> .>>>>
.> .>>>> _______________________________________________
.> .>>>> Bioperl-l mailing list
.> .>>>> Bioperl-l at lists.open-bio.org
.> .>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
.> .>>>
.> .>>>
.> .>>>
.> .>>> --
.> .>>> ------------------------------------------------------------------------
.> .>>> Scott Cain, Ph. D. scott at scottcain dot net
.> .>>> GMOD Coordinator (http://gmod.org/) 216-392-3087
.> .>>> Ontario Institute for Cancer Research
.> .>>> _______________________________________________
.> .>>> Bioperl-l mailing list
.> .>>> Bioperl-l at lists.open-bio.org
.> .>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
.> .>>
.> .>
.> .
.> .=====================================
.> .Jim Hu
.> .Professor
.> .Dept. of Biochemistry and Biophysics
.> .2128 TAMU
.> .Texas A&M Univ.
.> .College Station, TX 77843-2128
.> .979-862-4054
.> .
.> .
.> .
.> ._______________________________________________
.> .Bioperl-l mailing list
.> .Bioperl-l at lists.open-bio.org
.> .http://lists.open-bio.org/mailman/listinfo/bioperl-l
.
.=====================================
.Jim Hu
.Professor
.Dept. of Biochemistry and Biophysics
.2128 TAMU
.Texas A&M Univ.
.College Station, TX 77843-2128
.979-862-4054
.
More information about the Bioperl-l
mailing list