[EMBOSS] Indexing databases and their updates/new releases

Wed Oct 8 11:22:48 UTC 2003

I am currently tacking similar problems. The solution I've worked out for
now is to use SRS (www.lionbio.co.uk) as a database - it's free for academic
institutions, has very strong indexing tools that can use parallel
processing, and Emboss can be configured to work through it. It also has a
very professional web interface that can launch EMBOSS tools. The downside
is that it's a bit complicated to install and administer. I'll be glad to
hear alternative suggestions, tho.

The server I'm running SRS and EMBOSS on is a 4-processor Origin200, 270Mhz,
with 1GB RAM. Indexing (using SRS) the latest GenBank release, with NO
EST's, GSS, HTG, took approximately 24 hours, running in parallel on 4
processors.
-- 
Ran Rubinstein 
Dept. of Molecular Biology 
Faculty of Medicine, Hebrew University, Ein Karem 
Tel +972-2-6757906 Fax +972-2-6758992

-----Original Message-----
From: owner-emboss at hgmp.mrc.ac.uk [mailto:owner-emboss at hgmp.mrc.ac.uk] On
Behalf Of Nancy Yu
Sent: Wednesday, October 08, 2003 12:14 PM
To: emboss at embnet.org
Subject: [EMBOSS] Indexing databases and their updates/new releases

Hello,

I have a bunch of questions about the indexing of the databases.  First
of all, what kind of computers are people using to run Emboss?  I am
running on a Athlon MP2000+ dual processor with 1GB RAM (on Linux Redhat
9.0).  Running dbiflat for EMBL est*.dat has taken forever (about 5 days
and still not done yet).  Are people using 64-bit systems, cluster
systems, or other high-end computing systems?  Is Emboss designed to run
on these technologies?

I'm still confused about the dbiflat indexing process.  I know it
produces 4 files, acnum.hit, acnum.trg, division.lkp, entrynam.idx.  As
I read somewhere in the mail archive, division.lkp stores the location
of the database directory.  Doesn't this means that if we move our *.dat
file to a different directory, we would have to re-index again?  Hence,
everytime we download a new database, a new release, or an update, we
will have to re-index everything?  Also, if dbiflat was interrupted half
way through indexing, is it possible to continue where it left off?
>From my experience, it seems like the whole process starts over again.

Just wondering, are the included index files for databases like embl
(eg. division.ndx and other *.ndx files) useful at all for Emboss, or
are they more for other programs?  Can I somehow use these index files,
ie. is there a fast way of indexing a database that I missed, or am I
too clueless to know what I'm talking about?

My main concern is that at the speed it takes to index a new release of
 large databases like EMBL or Genbank, it would be difficult for me to
try to keep my local databases up-to-date.

Thanx in advance for any comments and explanations :)

Best Regards,
Nancy Yu