[Bioperl-l] Converting blast+ output to gff (with gaps)

Cook, Malcolm MEC at stowers.org
Fri Jan 4 23:33:04 UTC 2013


Jim,

To get a working example for further discussion....

Assuming you have bioperl and blast+ installed....

This command gets a known transcript from ncbi and blasts it back at ncbi into fly genome:

> bp_download_query_genbank.pl --query 'NM_001259364' | blastn -remote -db refseq_genomic  -entrez_query 'melanogaster[Taxid]'  -outfmt 6  > test.blast.tab

Take a look at the results (see below, or run it yourself).

First line output looks like this:

	NM_001259364	gi|116010442|ref|NT_033778.3|	100.00	2127	0	0	1869	3995	9330023	9327897	0.0	3928

Old GFF did not have parent-child.  Depending on where you are going with this, you MIGHT want to use this strategy: http://gmod.org/wiki/GFF#Alignments

Where the ID in column 9 serves to group the features.  There is an 'implicit' parent.

I think this would be respected by both GBrowse and UCSC genome browser (but I'm not positive).

If so, converting blast output would yield one line output per each line input.

You would need to assign a strand, swap sstart and send where sstart>send, and some slight re-formatting

First line of GFF might look like:

	NT_033778.3 est EST_match 9330023	9327897  . + . ID=Match1;Name= NM_001259364;Target= NM_001259364 1869	3995

Is this along the lines of what you need?

~Malcolm

Complete example blast+ output looks like this:

NM_001259364	gi|116010442|ref|NT_033778.3|	100.00	2127	0	0	1869	3995	9330023	9327897	0.0	3928
NM_001259364	gi|116010442|ref|NT_033778.3|	100.00	1019	0	0	3995	5013	9327834	9326816	0.0	1882
NM_001259364	gi|116010442|ref|NT_033778.3|	100.00	374	0	0	345	718	9332643	9332270	0.0	 691
NM_001259364	gi|116010442|ref|NT_033778.3|	100.00	300	0	0	1	300	9337978	9337679	1e-154	 555
NM_001259364	gi|116010442|ref|NT_033778.3|	100.00	270	0	0	717	986	9332201	9331932	5e-138	 499
NM_001259364	gi|116010442|ref|NT_033778.3|	100.00	243	0	0	1434	1676	9331200	9330958	5e-123	 449
NM_001259364	gi|116010442|ref|NT_033778.3|	99.55	223	0	1	1154	1376	9331611	9331390	1e-109	 405
NM_001259364	gi|116010442|ref|NT_033778.3|	100.00	196	0	0	1675	1870	9330634	9330439	7e-97	 363
NM_001259364	gi|116010442|ref|NT_033778.3|	100.00	169	0	0	986	1154	9331847	9331679	7e-82	 313
NM_001259364	gi|116010442|ref|NT_033778.3|	97.26	73	2	0	1372	1444	9331322	9331250	3e-25	 124
NM_001259364	gi|116010442|ref|NT_033778.3|	100.00	44	0	0	301	344	9333095	9333052	2e-12	82.4
NM_001259364	gi|195586644|ref|NT_167067.1|	96.52	2127	74	0	1869	3995	7777827	7775701	0.0	3518
NM_001259364	gi|195586644|ref|NT_167067.1|	97.03	404	7	3	4004	4402	7775629	7775226	0.0	 675
NM_001259364	gi|195586644|ref|NT_167067.1|	93.58	374	24	0	345	718	7780392	7780019	8e-156	 558
NM_001259364	gi|195586644|ref|NT_167067.1|	94.37	302	15	1	1	300	7786262	7785961	6e-127	 462
NM_001259364	gi|195586644|ref|NT_167067.1|	96.25	267	10	0	720	986	7779949	7779683	1e-119	 438
NM_001259364	gi|195586644|ref|NT_167067.1|	95.00	240	12	0	1434	1673	7778959	7778720	2e-101	 377
NM_001259364	gi|195586644|ref|NT_167067.1|	96.86	223	6	1	1154	1376	7779370	7779149	1e-99	 372
NM_001259364	gi|195586644|ref|NT_167067.1|	93.88	196	12	0	1675	1870	7778384	7778189	7e-77	 296
NM_001259364	gi|195586644|ref|NT_167067.1|	95.19	187	5	4	4827	5013	7774635	7774453	9e-76	 292
NM_001259364	gi|195586644|ref|NT_167067.1|	92.86	168	12	0	987	1154	7779598	7779431	3e-61	 244
NM_001259364	gi|195586644|ref|NT_167067.1|	92.86	70	5	0	1374	1443	7779079	7779010	2e-18	 102
NM_001259364	gi|195586644|ref|NT_167067.1|	100.00	44	0	0	301	344	7780833	7780790	2e-12	82.4
NM_001259364	gi|195489961|ref|NT_167063.1|	92.43	2127	161	0	1869	3995	7628910	7631036	0.0	3037
NM_001259364	gi|195489961|ref|NT_167063.1|	91.94	1030	38	20	3995	5013	7631102	7632097	0.0	1400
NM_001259364	gi|195489961|ref|NT_167063.1|	88.56	376	39	4	345	718	7626215	7626588	4e-124	 453
NM_001259364	gi|195489961|ref|NT_167063.1|	93.67	300	18	1	1	300	7620777	7621075	2e-122	 448
NM_001259364	gi|195489961|ref|NT_167063.1|	95.41	196	9	0	1675	1870	7628263	7628458	7e-82	 313
NM_001259364	gi|195489961|ref|NT_167063.1|	91.93	223	17	1	1154	1376	7627224	7627445	2e-81	 311
NM_001259364	gi|195489961|ref|NT_167063.1|	86.42	265	31	3	720	983	7626653	7626913	1e-73	 285
NM_001259364	gi|195489961|ref|NT_167063.1|	86.18	246	25	4	1434	1673	7627635	7627877	3e-65	 257
NM_001259364	gi|195489961|ref|NT_167063.1|	88.27	162	19	0	993	1154	7626999	7627160	3e-46	 195
NM_001259364	gi|195489961|ref|NT_167063.1|	93.06	72	5	0	1372	1443	7627513	7627584	1e-19	 106
NM_001259364	gi|195489961|ref|NT_167063.1|	100.00	44	0	0	301	344	7625770	7625813	2e-12	82.4


~Malcolm


 .-----Original Message-----
 .From: Jim Hu [mailto:jimhu at tamu.edu]
 .Sent: Friday, January 04, 2013 3:58 PM
 .To: Cook, Malcolm
 .Cc: 'Brian Osborne'; 'Fields, Christopher J'; 'Scott Cain'; 'bioperl-l at bioperl.org'
 .Subject: Re: [Bioperl-l] Converting blast+ output to gff (with gaps)
 .
 .Malcolm,
 .
 .Thanks, I should have reread the GFF3 spec before posting!
 .
 .In the section on the Gap attrribute and below on alignments it discusses two ways to represent an alignment. I was originally thinking
 .of something like the later example shown for cDNA vs genome. But the gap attribute representation would be fine too. So, I can see
 .how the final output could be done in different ways, but I'm still stuck on how to get there.
 .
 .I don't have a specific application in mind; I'm mostly just trying to understand how to get from having standalone blast+ output to get
 .to things that look like the examples in the gff spec and the gbrowse documentation - really basic display of alignments that are
 .gapped. For my teaching, we do EST vs genomic blast and want gapped cDNA alignments to show where the introns go. My other
 .work is with bacteria where introns are rare, but there are times when I'd like to show an alignment that is interrupted by a
 .transposable element, for example.
 .
 .Excerpting from blastp -help
 .
 . *** Formatting options
 . -outfmt <String>
 .   alignment view options:
 .     0 = pairwise,
 .     1 = query-anchored showing identities,
 .     2 = query-anchored no identities,
 .     3 = flat query-anchored, show identities,
 .     4 = flat query-anchored, no identities,
 .     5 = XML Blast output,
 .     6 = tabular,
 .     7 = tabular with comment lines,
 .     8 = Text ASN.1,
 .     9 = Binary ASN.1,
 .    10 = Comma-separated values,
 .    11 = BLAST archive format (ASN.1)
 .
 .Several of these are "lossy" in terms of where the actual gaps occur (e.g. 6). Others seem to me to be more human readable than
 .suited for parsing. So I was hoping to get pointed to an existing script that would generate either the single feature with gap attribute
 .OR the multi-line match features OR a combination from one of these output formats.
 .
 .I'm probably missing something very, very obvious.
 .
 .Best,
 .
 .Jim
 .
 .
 .On Jan 4, 2013, at 2:20 PM, Cook, Malcolm wrote:
 .
 .> Jim,
 .>
 .> Getting to your original question:
 .>
 .>> I'm looking for a script that will take one of the blast+ outformats that includes the positions of gaps and mismatches, and .create
 .gff with appropriate subfeatures.
 .>
 .> Exactly what/how do you want/expect to encode the blast output as GFF{1,2,2.5,3}??
 .>
 .> If GFF3 pe http://www.sequenceontology.org/gff3.shtml then are you hoping to get GFF3 marked up as described in section 'THE
 .GAP ATTRIBUTE' or as in 'ALIGNMENTS'
 .>
 .> I would guess not because neither of them have 'subfeatures'.
 .>
 .> If you could explain more fully with examples (hand cobbled or borrowed from someone else) of what you expect then I might have
 .a better idea of what options might suit your needs.
 .>
 .>
 .> ~Malcolm
 .>
 .>
 .> .-----Original Message-----
 .> .From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Jim Hu
 .> .Sent: Friday, January 04, 2013 1:50 PM
 .> .To: Brian Osborne
 .> .Cc: Fields, Christopher J; Scott Cain; bioperl-l at bioperl.org
 .> .Subject: Re: [Bioperl-l] Converting blast+ output to gff (with gaps)
 .> .
 .> .Thanks for the replies, but...
 .> .
 .> .I can't tell what input formats for the blast results file are supported.  Format 11 and format 6 give no output and no feedback.
 .Putting
 .> .some diagnostic print statements in the code suggests that I'm not getting any result objects from Bio::SearchIO.
 .> .
 .> .The script uses Bio::SearchIO, but does not seem to call the submodules for blast.  Documentation links on the wiki seem to be
 .> .broken, at least on this page:
 .> .
 .> .	http://www.bioperl.org/wiki/Module:Bio::SearchIO
 .> .
 .> .Jim
 .> .
 .> .
 .> .On Jan 2, 2013, at 4:53 PM, Brian Osborne wrote:
 .> .
 .> .> Scott and Chris,
 .> .>
 .> .> I'll test it and see...
 .> .>
 .> .> Brian O.
 .> .>
 .> .>
 .> .> On Jan 2, 2013, at 5:26 PM, "Fields, Christopher J" <cjfields at illinois.edu> wrote:
 .> .>
 .> .>> It should (I recall using it at one point).  If it doesn't we should fix it so it does.
 .> .>>
 .> .>> How does MAKER deal with this?  IIRC it uses (a modified) SearchIO-based method...
 .> .>>
 .> .>> chris
 .> .>>
 .> .>> On Jan 2, 2013, at 3:32 PM, Scott Cain <scott at scottcain.net> wrote:
 .> .>>
 .> .>>> Hi Brian,
 .> .>>>
 .> .>>> I was going to suggest the same thing--though that script is fairly
 .> .>>> old, it's not as old as the blast2gff script in the GBrowse
 .> .>>> distribution (which probably should be retired).  I believe it
 .> .>>> supports GFF3, though I don't have any sample data with which to test
 .> .>>> it to be sure.  I also don't know if it supports BLAST+ input--I
 .> .>>> haven't kept up with SearchIO (on which search2gff.pl depends); will
 .> .>>> it accept it?
 .> .>>>
 .> .>>> Scott
 .> .>>>
 .> .>>>
 .> .>>> On Wed, Jan 2, 2013 at 3:26 PM, Brian Osborne <bosborne11 at verizon.net> wrote:
 .> .>>>> Here's one:
 .> .>>>>
 .> .>>>> https://github.com/GMOD/GBrowse/blob/master/contrib/blast2gff.pl
 .> .>>>>
 .> .>>>> Another one:
 .> .>>>>
 .> .>>>> ~/git/bioperl-live>head scripts/utilities/bp_search2gff.pl
 .> .>>>> #!perl
 .> .>>>>
 .> .>>>> # Author:      Jason Stajich <jason-at-bioperl-dot-org>
 .> .>>>> # Description: Turn SearchIO parseable report(s) into a GFF report
 .> .>>>> #
 .> .>>>> =head1 NAME
 .> .>>>>
 .> .>>>> bp_search2gff - Turn SearchIO parseable reports(s) into a GFF report
 .> .>>>>
 .> .>>>>
 .> .>>>>
 .> .>>>> Brian O.
 .> .>>>>
 .> .>>>> On Jan 2, 2013, at 2:44 PM, Jim Hu <jimhu at tamu.edu> wrote:
 .> .>>>>
 .> .>>>>> I assume this has already been done many times, but I can't seem to find it on bioperl.org or via google.
 .> .>>>>>
 .> .>>>>> I'm looking for a script that will take one of the blast+ outformats that includes the positions of gaps and mismatches, and
 .> .create gff with appropriate subfeatures.
 .> .>>>>>
 .> .>>>>> Thanks,
 .> .>>>>>
 .> .>>>>> Jim
 .> .>>>>> =====================================
 .> .>>>>> Jim Hu
 .> .>>>>> Professor
 .> .>>>>> Dept. of Biochemistry and Biophysics
 .> .>>>>> 2128 TAMU
 .> .>>>>> Texas A&M Univ.
 .> .>>>>> College Station, TX 77843-2128
 .> .>>>>> 979-862-4054
 .> .>>>>>
 .> .>>>>>
 .> .>>>>>
 .> .>>>>> _______________________________________________
 .> .>>>>> Bioperl-l mailing list
 .> .>>>>> Bioperl-l at lists.open-bio.org
 .> .>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
 .> .>>>>
 .> .>>>>
 .> .>>>> _______________________________________________
 .> .>>>> Bioperl-l mailing list
 .> .>>>> Bioperl-l at lists.open-bio.org
 .> .>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
 .> .>>>
 .> .>>>
 .> .>>>
 .> .>>> --
 .> .>>> ------------------------------------------------------------------------
 .> .>>> Scott Cain, Ph. D.                                   scott at scottcain dot net
 .> .>>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
 .> .>>> Ontario Institute for Cancer Research
 .> .>>> _______________________________________________
 .> .>>> Bioperl-l mailing list
 .> .>>> Bioperl-l at lists.open-bio.org
 .> .>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
 .> .>>
 .> .>
 .> .
 .> .=====================================
 .> .Jim Hu
 .> .Professor
 .> .Dept. of Biochemistry and Biophysics
 .> .2128 TAMU
 .> .Texas A&M Univ.
 .> .College Station, TX 77843-2128
 .> .979-862-4054
 .> .
 .> .
 .> .
 .> ._______________________________________________
 .> .Bioperl-l mailing list
 .> .Bioperl-l at lists.open-bio.org
 .> .http://lists.open-bio.org/mailman/listinfo/bioperl-l
 .
 .=====================================
 .Jim Hu
 .Professor
 .Dept. of Biochemistry and Biophysics
 .2128 TAMU
 .Texas A&M Univ.
 .College Station, TX 77843-2128
 .979-862-4054
 .





More information about the Bioperl-l mailing list