[Bioperl-l] COG software?

Mon, 21 Jan 2002 17:10:50 -0400 (AST)

> You can download the COG proteins from ncbi in the dir /pub/COG/COGs.  So
> you can get all the protein sequences that make up the COG - just
> blastx/fastxy your unfinished genomic sequence against these.  Looks like
> there are about 3700 COGs so one would need to combine these into a single
> db to search against - if you want to retain which COG a protein is from
> you should probably append/prepend the name of the COG to it.

Provided you have the computational power...
What I've found very useful is taking the MSAs provided by NCBI and
creating an HMM library.  The COG mapping file looks like:

[H] COG0001 Glutamate-1-semialdehyde aminotransferase
  A   AF1241
  B   BS_gsaB BS_hemL BH0943 BH2941 BH3043
  C   sll0017

So assigning the COG (with its functional category and name) to an MSA/HMM
is easy.

Below is the output using the cognitor example protein, but run against
the HMM library (on the Canadian Bioinformatics Resource GeneMatcher).
Nice and clean I think. The hit is more significant than from COGnitor, of
course, because the similarity is against the COG as a whole, not an
individual member of it.  If these HMMs are useful to people, they could
probably be put on an FTP site for download, once I get the latest
version of COGs...

 BTK 4.1.0-79/79 2001-08-23 (Fdf Client 1.442)

Copyright 2001 Paracel, Inc

Query=  |4884278|hypothetical protein [Homo sapiens] 
        (323 letters)

Database:  
           2,885 sequences; 1,038,652 total letters.
Searching.......................................................done.

                                                                        E
Sequences producing significant alignments:                    Score Value

COG1262 [S] Uncharacterized BCR                                   40 4e-10

>COG1262 [S] Uncharacterized BCR
           Length = 524

 Score =  0.0 bits (40), Expect = 4e-10
 Identities = 27/72 (37%), Positives = 46/72 (63%), Gaps = 10/72 (13%)

Query: 28  ATSMVQLQGGR-FLMGTNSPD--------SRDGEGP-VREATVKPFAIDIFPVTNKDFRD 77
           AT MV++ GG  F MG++  +        S D E+P +   +V++FA+D+ PVTN++F++
Sbjct: 198 ATEMVLIPGGSGFVMGSTEAEIGFAARGGSQDDERPLEHVVFVRAFALDKYPVTNAQFAE 257

Query: 78  FVREKKYRTEAE 89
           FV   +Y+T A+
Sbjct: 258 FVEATGYTTKAA 269

 Score =  0.0 bits (26), Expect = 3e-06
 Identities = 18/43 (41%), Positives = 25/43 (57%), Gaps = 15/43 (34%)

Query: 151 PVNAFPA---------------QNNYGLYDLLGNVWEWTASPY 178
           PV ++P+                N +GLYD+LGNVWEWTA++Y
Sbjct: 415 PVGSYPPEAANIQSTAPVAEFGANALGLYDMLGNVWEWTADEY 457

==========================================
Paul Gordon
Research Associate
Bioinformatics Lab, University of Calgary