[Bioperl-l] COG software?
Fernan Aguero
fernan@iib.unsam.edu.ar
Mon, 21 Jan 2002 17:27:29 -0300
+----[ Rick Westerman (westerman@purdue.edu) dijo sobre "[Bioperl-l] COG software?":
|
| I asked about COG searching a week or so in bionet.software but did
| not receive a good reply. Since it looks like I will end up writing the
| search unless I find something pre-done, I will repeat my question in this
| forum.
|
| Does anyone know of a way to send a bunch of potentially incomplete
| sequences (i.e., those from a partially completed genome) through NCBI's
| COG database? The NCBI 'coginator' page seems to let only single
| sequences to be analyzed at a time. There are standalone programs
| (dignitor and xugnitor) available via FTP in /pub/tatusov/dignitor. These
| may do what I want. Although I am afraid that the readme is only a 26
| lines long cryptic document and the 1500+ lines of 'C' code do not contain
| a single comment section. Not even a "written by" or "copyright"
| section. Makes me shake my head in
| disbelief. :-(
|
Rick,
I have been through this also, tough it was some time ago. I'm afraid
things have changed since. I will share my experiences with this and
also suggest some workarounds.
But one alternative, clearly, is to someway 'hack' the cognitor page
at NCBI and make it run your requests remotely from a script. Don't be
too aggresive on the NCBI servers or they may ban you. This can be
done using perl. I guess that this is the way that the remote blasts
work under bioperl (am i right?). If this is your choice I'm sure
other people in this list can help you.
Now on to dignitor, and my own experience with it (yes I got it
working).
As some obscure page at NCBI says, you can use dignitor to run
cognitor in batch mode. So yes, you can download dignitor and use it,
but again yes documentation is scarce, if not lacking at all.
You also have to download the database of COGs, and also a file named
cogan if I remember well. (don't run to download them yet, read on).
Since I was then totally impaired in C - haven't improved yet - I have
the C code read by a colleague and we finally understood what dignitor
needed to be passed on the command-line: a file of pairs and the
location of the cogan file.
The file of pairs is derived from a comparison of your sequences
against the protein sequences of the genomes that are included in the
COG database. The format of the file is simple, something like:
accession score expect qstart..qend sstart..send
where qstart,qend refer to the query and sstart,ssend to the subject.
If you had all this, dignitor worked quite well. All it does is look
through the file of pairs and see if the sequences are really
homologous by comparing the positions of the HSPs. If they are, bingo!
the subject sequence already points to a COG so you can assume that
your query also belongs to that COG. I don't know if dignitor also
checks your query against other member of the COG, maybe not.
The bad news is that after a while, when I came back to download a
more recent version of the COG database, things have changed. The
database and other files were in a new format and the old cogan file
was not updated. I guess Dignitor was not updated to recognize the new
formats - at least it didn't work for me, though I haven't put too
much work into it.
I haven't looked into it again for some time. So I don't know if this
has been fixed. (I don't know what that xugnitor does ...)
However, it is not so difficult to recreate this functionality. You
only need a relationship between 'genbank accession' -> 'COG number'
to start with (I think that the COG database still provides this).
Then you'll need to compare your sequences to sequences in the COG
database using blastp, somehow filter the results and get a cleaner
list of HSPs (perhaps as a file of pairs also), and then write some
decision-making routine to decide on putative 'homology' (or should I
say orthology?) based on the score, and start-end positions in the
HSP.
Perhaps you can get someone versed in C to help you and port a
modified dignitor it in Perl or some other language familiar to you?
The authors of the software (and the maintainers of all the COG
related stuff) are not very responsive. I have emailed them in the
past without success. I would not count on them for help.
Good Luck,
Fernan
| Any further help would be appreciated.
|
| Thanks,
|
| -- Rick
|
| Rick Westerman
| westerman@purdue.edu
|
+----]
--
| F e r n a n A g u e r o | B i o i n f o r m a t i c s |
| fernan@iib.unsam.edu.ar | genoma.unsam.edu.ar |