[Biojava-l] Added SequenceDB support for EMBL CD-ROM indices

Keith James kdj@sanger.ac.uk
02 Aug 2001 15:47:04 +0100


I've added classes and tests to support EMBL CD-ROM format
indices. For those unfamiliar with them, they are simple binary
indices used by EMBOSS and Staden software to index sequence files in
Fasta, EMBL and GCG formats.

Package org.biojava.bio.seq.db.emblcd contains readers for the 4 file
types in the index and a random access class for the entrynam.idx
(sequence ID) file.

EmblCDROMIndexStore is an implementation of IndexStore which allows an
IndexedSequenceDB to be created directly from an EMBOSS-indexed
database (e.g. whole of EMBL). Unlike TabIndexStore, which reads all
the IDs into memory, EmblCDROMIndexStore uses a binary search via a
pointer into the file to find IDs. Only if the interface's getIDs()
method is called is the whole index scanned.

I've tried it by getting Sequences from Fasta formatted databases and
it seems fine. There are unit tests, but no integration tests for
Sequence fetching yet (ie. for IndexedSequenceDB).

In build.xml I added the build.src.tests directory to the JUnit
classpath to allow the test data files to be found using
getResource(). If the test data should be somewhere else (other than
in with the test source) I'm open to moving it.

Hopefully this will allow access to the big databases with reasonable
performance.

cheers,

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA