[Biojava-l] Added EMBL CDROM format index readers

Keith James kdj@sanger.ac.uk
09 Jul 2001 17:20:30 +0100


I found that the current indexing scheme didn't scale too well to
massive databases so I've been looking into making IndexedSequenceDB
instances using the EMBL CDROM format binary indices.

These are pre-alpha at the moment, but I'm hoping to get them tested
and working soon.


org.biojava.bio.seq.db.emblcd.EmblCDROMIndexReader

/**
 * <p><code>EmblCDROMIndexReader</code> reads EMBL CD-ROM format
 * indices from an underlying <code>InputStream</code>. This format is
 * used by the EMBOSS package for database indexing (see programs
 * dbiblast, dbifasta, dbiflat and dbigcg). Indexing produces four
 * binary files with a simple format:</p>
 * 
 * <ul>
 *   <li>division.lkp : master index</li>
 *   <li>entrynam.idx : sequence ID index</li>
 *   <li>   acnum.trg : accession number index</li>
 *   <li>   acnum.hit : accession number auxiliary index</li>
 * </ul>
 *
 * <p>Internally EMBOSS checks for Big-endian architechtures and
 * switches the byte order to Little-endian. This means trouble if you
 * try to read the file using <code>DataInputStream</code>, but at
 * least the binaries are consistent across architechtures. This class
 * carries out the necessary conversion.</p>

 subclasses:

org.biojava.bio.seq.db.emblcd.EntryNamIdxReader
org.biojava.bio.seq.db.emblcd.DivisionLkpReader
org.biojava.bio.seq.db.emblcd.AcnumTrgReader
org.biojava.bio.seq.db.emblcd.AcnumHitReader


-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA