[EMBOSS] Indexing databases and their updates/new releases

Alan Bleasby ableasby at hgmp.mrc.ac.uk
Wed Oct 8 10:46:26 UTC 2003


The current indexing system is a little long in the tooth. I am
currently testing/configuring its replacement on the various
platforms. The new indexing system will not be available in the
next release but should be out within the next few months.

It is intended that the next release of EMBOSS will, by default,
compile with large file support on all platforms (although the
old indexing system cannot support files >2Gb). So, 2.8.0 will
have 64bit (aj)long integers.

Features of the new indexing system are that it is a B+ tree
structure and therefore can be dynamically updated (e.g. with
EMBL updates), can handle large files and duplicate IDs. It
also does not need any sorting operations (the major cause of
slowness for the old indexing system).

Timing for the new system to index the latest EMBL release on a
Linux system with the same spec as yours is around 9hrs
(8hrs of which is just reading the database over NFS).

EMBOSS is not designed for parallel processing although we have
identified the areas of code that would need attention should
we go down that route in the future.

It is not (in general) possible to restart the old indexing system
where it left off. That is a feature that could be implemented in the
new indexing system (but isn't yet).

I'm sure others may answer anything I've missed.

HTH

Alan Bleasby
HGMP



More information about the EMBOSS mailing list