[Bioperl-l] HMMER3 to GFF3

Doug Hoen douglas.hoen at gmail.com
Thu Aug 12 05:59:37 UTC 2010


Hi,

 I am trying to convert HMMER3 (hmmscan) output files into GFF3 files.
Based on previous advice (see the thread, "How to store results of
searches of translated DNA in SeqFeature::Store database of the
original DNA?"), I have installed bioperl-live for its new HMMER3
parsing capabilities (in SearchIO) and am trying to use
bp_search2gff.pl to do the file conversion.

The hmmscan was done on translated chromosome sequences with conserved
domain models. I want to get the GFF 'start' and 'end' columns to be
based on these coordinates, not those of the models. To do this (with
my files), it seems I need to use the option "--type hit". However,
this changes the "Target" sequence name from the model name to
chromosome name, and the model name does not appear anywhere in the
output (see below).

Could someone please confirm whether the results are incorrect and, if
so, perhaps suggest a fix? It may well be that this problem is due to
the unusual way I am using hmmscan, rather than a problem with HMMER3
parsing...?

Many thanks,
-- Doug


========================================================


Here's what it looks like if I do *not* use the "--type hit" option.
(RVT_2 is a conserved domain name. I need this in the output.)


COMMAND:
------------------
bp_search2gff.pl -i ../chr1-tesigsv2.hmmscan -o chr1-tesigsv2-hmmscan-
original-locations-v2.gff3 --format hmmer3 --source HMMER3 --version 3
--component


OUTPUT:
------------------
==> chr1-tesigsv2-hmmscan-original-locations-v2.gff3 <==
##gff-version 3
Chr1_1	chromosome	Component	1	10142557	.	.	1	sequence=Chr1_1
Chr1_1	HMMER3	similarity	1	245	307.3	.	0	Target=Sequence:RVT_2 1898330
1898579
Chr1_1	HMMER3	similarity	1	244	329.5	.	0	Target=Sequence:RVT_2 2573551
2573796
Chr1_1	HMMER3	similarity	1	245	308.8	.	0	Target=Sequence:RVT_2 3159685
3159930
Chr1_1	HMMER3	similarity	1	102	108.2	.	0	Target=Sequence:RVT_2 3438684
3438791
Chr1_1	HMMER3	similarity	2	245	277.2	.	0	Target=Sequence:RVT_2 3566642
3566891
Chr1_1	HMMER3	similarity	13	213	251.4	.	0	Target=Sequence:RVT_2
4251160 4251373
Chr1_1	HMMER3	similarity	1	244	310.6	.	0	Target=Sequence:RVT_2 4252791
4253036
Chr1_1	HMMER3	similarity	6	99	94.2	.	0	Target=Sequence:RVT_2 4271555
4271653


========================================================


And here's what it looks like if I *do* use the "--type hit" option.
The coordinates look good but the model name has disappeared (and the
Target=Sequence seems wrong).


COMMAND:
------------------
bp_search2gff.pl -i ../chr1-tesigsv2.hmmscan -o chr1-tesigsv2-hmmscan-
original-locations-v3.gff3 --format hmmer3 --type hit --source HMMER3
--version 3 --component


OUTPUT:
------------------
==> chr1-tesigsv2-hmmscan-original-locations-v3.gff3 <==
##gff-version 3
RVT_2	HMMER3	similarity	1898330	1898579	307.3	.	0
Target=Sequence:Chr1_1 1 245
RVT_2	HMMER3	similarity	2573551	2573796	329.5	.	0
Target=Sequence:Chr1_1 1 244
RVT_2	HMMER3	similarity	3159685	3159930	308.8	.	0
Target=Sequence:Chr1_1 1 245
RVT_2	HMMER3	similarity	3438684	3438791	108.2	.	0
Target=Sequence:Chr1_1 1 102
RVT_2	HMMER3	similarity	3566642	3566891	277.2	.	0
Target=Sequence:Chr1_1 2 245
RVT_2	HMMER3	similarity	4251160	4251373	251.4	.	0
Target=Sequence:Chr1_1 13 213
RVT_2	HMMER3	similarity	4252791	4253036	310.6	.	0
Target=Sequence:Chr1_1 1 244
RVT_2	HMMER3	similarity	4271555	4271653	94.2	.	0
Target=Sequence:Chr1_1 6 99
RVT_2	HMMER3	similarity	4481232	4481477	281.5	.	0
Target=Sequence:Chr1_1 2 245


========================================================


And here's what the input HMMER3 result file looks like:


==> ../chr1-tesigsv2.hmmscan <==
# hmmscan :: search sequence(s) against a profile database
# HMMER 3.0rc1 (February 2010); http://hmmer.org/
# Copyright (C) 2010 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- -
# query sequence file:             [...]/whole_chromosomes/translated/
chr1.pep
# target HMM database:             [...]/signatures/Pfam-A.hmm
# output directed to file:         chr1-tesigsv2.hmmscan
# model-specific thresholding:     TC cutoffs
# Max sensitivity mode:            on [all heuristic filters off]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- -

Query:       Chr1_1  [L=10142557]
Description: CHROMOSOME dumped from ADB: Jun/20/09 14:53; last
updated: 2009-02-02
Scores for complete sequence (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N
Model           Description
    ------- ------ -----    ------- ------ -----   ---- --
--------        -----------
          0 3971.3  17.7   2.6e-101  329.5   0.6   19.4 17
RVT_2           Reverse transcriptase (RNA-dependent DNA pol
          0 3040.7  23.0     1e-206  678.6   0.1   12.2 10
ATHILA          ATHILA ORF-1 family
          0 1681.9  79.1    1.9e-46  149.9   0.4   28.0 21
RVT_1           Reverse transcriptase (RNA-dependent DNA pol
          0 1446.9  27.4    3.6e-95  309.1   0.2    7.6  5
Transposase_21  Transposase family tnp2
          0 1168.4  50.3    1.4e-29   94.4   0.3   21.5 18
rve             Integrase core domain
   9.1e-300  960.0  69.0    3.1e-20   64.0   0.0   28.8 20
Retrotrans_gag  Retrotransposon gag protein
   1.5e-180  577.0  31.6    1.6e-29   93.1   1.5    9.5  8
Transposase_23  TNP1/EN/SPM transposase
   4.4e-143  456.9  82.8    4.8e-18   56.4   0.1   12.9 11
MuDR            MuDR family transposase
   3.8e-116  371.4  19.6    1.2e-18   58.9   0.0   13.7  7
MULE            MULE transposase domain
   7.1e-106  344.1   5.6    2.7e-97  316.0   0.0    3.6  1
Plant_tran      Plant transposon protein
    9.2e-85  275.4  22.9    5.4e-60  194.4   0.3    6.4  3
Peptidase_C48   Ulp1 protease family, C-terminal catalytic d
    1.8e-77  249.8  24.8    4.4e-28   89.8   0.1   10.8  3
Transposase_24  Plant transposase (Ptta/En/Spm family)
    2.8e-47  150.1   1.2    5.5e-23   72.3   0.2    3.7  2
hATC            hAT family dimerisation domain
    5.7e-28   89.4   3.6    4.7e-13   41.1   0.0    6.5  1
RVP_2           Retroviral aspartyl protease
      1e-16   53.3   0.0    4.4e-07   22.1   0.0    6.8  1
RnaseH          RNase H
    1.5e-08   25.3   2.4    0.00016   12.1   0.0    4.9  0
Transposase_mut Transposase, Mutator family


Domain annotation for each model (and alignments):
>> RVT_2  Reverse transcriptase (RNA-dependent DNA polymerase)
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom
ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    -------
-------    ------- -------    ----
   1 !  307.3   0.0   5.3e-95   1.5e-94       1     245 [. 1898330
1898578 .. 1898330 1898579 .. 0.99
   2 !  329.5   0.6  8.9e-102  2.6e-101       1     244 [. 2573551
2573794 .. 2573551 2573796 .. 0.99
   3 !  308.8   0.0   1.8e-95   5.2e-95       1     245 [. 3159685
3159929 .. 3159685 3159930 .. 0.99
   4 !  108.2   0.1   3.4e-34   9.7e-34       1     102 [. 3438684
3438785 .. 3438684 3438791 .. 0.96
   5 !  277.2   0.0   8.1e-86   2.3e-85       2     245 .. 3566643
3566890 .. 3566642 3566891 .. 0.99
   6 !  251.4   0.0   6.2e-78   1.8e-77      13     213 .. 4251164
4251364 .. 4251160 4251373 .. 0.97
   7 !  310.6   0.0   5.1e-96   1.5e-95       1     244 [. 4252791
4253034 .. 4252791 4253036 .. 0.99
   8 !   94.2   0.1   6.1e-30   1.8e-29       6      99 .. 4271560
4271653 .. 4271555 4271653 .. 0.97
   9 !  281.5   0.9   3.9e-87   1.1e-86       2     245 .. 4481233
4481476 .. 4481232 4481477 .. 0.98
  10 !  248.2   0.0   5.9e-77   1.7e-76       1     190 [. 4521040
4521233 .. 4521040 4521237 .. 0.97
  11 !  314.6   0.1   3.2e-97   9.2e-97       1     244 [. 4652456
4652702 .. 4652456 4652704 .. 0.98
  12 !   40.7   0.0   1.3e-13   3.7e-13       2      92 .. 5219607
5219697 .. 5219606 5219701 .. 0.90
  13 !  221.0   0.0   1.2e-68   3.4e-68       2     245 .. 5241015
5241258 .. 5241014 5241259 .. 0.95
  14 !   81.2   0.0   5.6e-26   1.6e-25       2     115 .. 5501957
5502070 .. 5501956 5502080 .. 0.92
  15 !  272.4   0.0   2.3e-84   6.7e-84      30     245 .. 6483057
6483271 .. 6483050 6483272 .. 0.98
  16 !  178.5   0.0   1.2e-55   3.3e-55      81     244 .. 7250563
7250726 .. 7250552 7250728 .. 0.96
  17 !  313.7   0.0   5.9e-97   1.7e-96       2     245 .. 7707124
7707367 .. 7707123 7707368 .. 0.99

  Alignments for each domain:
  == domain 1    score: 307.3 bits;  conditional E-value: 5.3e-95
   RVT_2       1
nktwelvelpkgkkviglkWvfklKlnedgeierykARlVakGftqkegidyeetfspvvklesirlllalaaekkleleqlDvktaFLngelee
95
                 n tw +++lp gkk++g+kWv+k+Kln+dg++erykARlVakG+tq+eg+dy
+tfspv+kl++++ll+a+aa+k+++l+qlD+++aFLng+l+e
  Chr1_1 1898330
NGTWVVCSLPVGKKAVGCKWVYKIKLNADGSLERYKARLVAKGYTQTEGLDYVDTFSPVAKLTTVKLLIAVAAAKGWSLSQLDISNAFLNGSLDE
1898424
 
68*********************************************************************************************
PP

   RVT_2      96
evYvkqpeGfedkkk....enkvckLkkslYgLkqapraWyeklsevllklgfkkseadkclfvkkkeeeliivllYVDDlliagsskelieelk
186
                 e+Y++ p+G++ ++     +n vc+LkkslYgLkqa+r+Wy k+se l++lgf+
+s+ d++lf++k++++ ++vl+YVDD++ia+s +++ e l
  Chr1_1 1898425
EIYMTLPPGYSPRQGdsfpPNAVCRLKKSLYGLKQASRQWYLKFSESLKALGFTQSSGDHTLFTRKSKNSYMAVLVYVDDIIIASSCDRETELLR
1898519
 
***********998889999***************************************************************************
PP

   RVT_2     187
eeLkkefemkdlgelkyfLgleierkeegillsqekyvkkllkkfkmedakpvstplea 245
                 ++L+++ +++dlg+l+yfLglei+r+++gi+++q+ky+ +ll+++++  +k++s
+p+e+
  Chr1_1 1898520
DALQRSSKLRDLGTLRYFLGLEIARNTDGISICQRKYTLELLAETGLLGCKSSSVPMEP 1898578
 
*********************************************************97 PP

  == domain 2    score: 329.5 bits;  conditional E-value: 8.9e-102
   RVT_2       1
nktwelvelpkgkkviglkWvfklKlnedgeierykARlVakGftqkegidyeetfspvvklesirlllalaaekkleleqlDvktaFLngelee
95
                 n+twel++lp+g+k+ig+kWv+k K+n++ge+erykARlVakG++q++gidy+e
+f+pv++le++rl+++laa++k++++q+D k aFLng++ee
  Chr1_1 2573551
NDTWELTSLPNGHKAIGVKWVYKAKKNSKGEVERYKARLVAKGYSQRAGIDYDEVFAPVARLETVRLIISLAAQNKWKIHQMDFKLAFLNGDFEE
2573645
 
79*********************************************************************************************
PP

   RVT_2      96
evYvkqpeGfedkkkenkvckLkkslYgLkqapraWyeklsevllklgfkkseadkclfvkkkeeeliivllYVDDlliagsskelieelkeeLk
190
                 evY++qp+G+ +k++e+kv++Lkk+lYgLkqapraW++++++++++++f k+ +
+++l++k ++e+++i +lYVDDl+++g++ ++ ee+k+e++
  Chr1_1 2573646
EVYIEQPQGYIVKGEEDKVLRLKKALYGLKQAPRAWNTRIDKYFKEKDFIKCPYEHALYIKIQKEDILIACLYVDDLIFTGNNPSMFEEFKKEMT
2573740
 
***********************************************************************************************
PP

   RVT_2     191
kefemkdlgelkyfLgleierkeegillsqekyvkkllkkfkmedakpvstple 244
                 kefem+d+g ++y+Lg+e+++++++i+++qe y+k++lkkfkm+d++pv tp
+e
  Chr1_1 2573741
KEFEMTDIGLMSYYLGIEVKQEDNRIFITQEGYAKEVLKKFKMDDSNPVCTPME 2573794
 
****************************************************97 PP



More information about the Bioperl-l mailing list