[Biojava-l] Sequence retrieval

Fri Jul 25 10:44:10 EDT 2003

>>>>> "Jeffrey" == Jeffrey Rosenfeld <jeffr at amnh.org> writes:

    Jeffrey> I am new to this list, so my question might have already
    Jeffrey> been discussed, but I cannot find any reference to it in
    Jeffrey> the archive, so here goes: I am trying to find a quick
    Jeffrey> java-only way to retrieve sequences from a blast
    Jeffrey> database.  I am writing a program that needs to obtain
    Jeffrey> large amounts of sequences from a fairly large database.
    Jeffrey> I have tried using fastacmd, but there is a great
    Jeffrey> slowdown because of teh need to start up an external
    Jeffrey> process for each sequence query.  (I cannot execute one
    Jeffrey> large fastacmd job because of the large amounts of
    Jeffrey> sequence that I am querying. ) I know that biojava has
    Jeffrey> many different formats for storing sequences, but I don't
    Jeffrey> want to have to keep two databases of my sequences
    Jeffrey> updated.  I am already using the blast database for
    Jeffrey> blast, so I don't want another database.  Is there a
    Jeffrey> simple way to implement fastacmd or somethign similar in
    Jeffrey> java?  It should not be too hard to do either using JNI
    Jeffrey> or reverse engineering the fastacmd code.

Hi Jeffrey,

This is possible, but you would at least need to make a new
(additional) index of the Blast database. Biojava does not have a
reader for blast indices because their format is different between
ncbi/wu flavours and is also apt to change.

Brief background on the available indices - we started with our own
system (see interfaces org.biojava.bio.seq.db.Index,
org.biojava.bio.seq.db.IndexStore and the TabIndexStore implementation
of IndexStore).

Later an indexing system common to all the Bio* projects was proposed
and implemented (i.e. you can index with Bioperl and read in Biopython
etc). See the obf-common cvs package for a full spec and other docs
via webcvs at http://cvs.open-bio.org. This is quite heavily
integrated with a system-wide registry for local and distributed
databases (also described in obf-commion docs), which you won't need
to worry about as you just want a simple lookup.

To use this system... there is an end-user indexing program
org.biojava.app.BioFlatIndex which can create the index (actually a
directory containing metadata and offsets into sequence
files). Alternatively you can programmatically index using the
org.biojava.bio.program.indexdb.IndexTools class. See the unit tests
(in cvs, org.biojava.bio.program.indexdb.IndexToolsTest) for examples
such as:

    public void testIndexFastaDNA() throws Exception
    {
        File [] files = getDBFiles(new String [] { "dna1.fasta",
                                                   "dna2.fasta" });
        IndexTools.indexFasta("test", new File(location),
                              files, SeqIOConstants.DNA);

        SequenceDBLite db = new FlatSequenceDB(location, "dna");

        Sequence seq1 = db.getSequence("id1");
        assertEquals("gatatcgatt", seq1.seqString());
        Sequence seq2 = db.getSequence("id2");
        assertEquals("ggcgcgcgcg", seq2.seqString());
        Sequence seq3 = db.getSequence("id3");
        assertEquals("ccccccccta", seq3.seqString());
        Sequence seq4 = db.getSequence("id4");
        assertEquals("tttttcgatt", seq4.seqString());
        Sequence seq5 = db.getSequence("id5");
        assertEquals("ggttcgcgcg", seq5.seqString());
        Sequence seq6 = db.getSequence("id6");
        assertEquals("nnnnnnttna", seq6.seqString());
    }

Finally, the binary indices created by the Staden package and EMBOSS
(Embl CDROM format) are also supported. If you index your flatfiles
with dbifasta/dbiblast you can read the EMBOSS indices from Biojava
with a little effort. This uses an EmblCDROM implmementation of our
old IndexStore interface. The unit tests
(org.biojava.bio.seq.db.EmblCDROMIndexStoreTest) should prove useful:

   URL divURL =
      EmblCDROMIndexStoreTest.class.getResource("emblcd/division.lkp");
   URL entURL =
      EmblCDROMIndexStoreTest.class.getResource("emblcd/entrynam.idx");

    File divisionLkp = new File(divURL.getFile());
    File entryNamIdx = new File(entURL.getFile());

    format  = new FastaFormat();
    alpha   = ProteinTools.getAlphabet();
    parser  = alpha.getTokenization("token");
    factory =
      new FastaDescriptionLineParser.Factory(SimpleSequenceBuilder.FACTORY);

   EmblCDROMIndexStore
     emblCDIndexStore = new EmblCDROMIndexStore(divisionLkp,
                                                entryNamIdx,
                                                format,
                                                factory,
                                                parser);

  emblCDIndexStore.setPathPrefix(entryNamIdx.getParentFile().getAbsoluteFile());

  SequenceDB
   sequenceDB = new IndexedSequenceDB(emblCDIndexStore);

and later...

        // Test actual sequence fetches
        Sequence seq = sequenceDB.getSequence("NMA0007");
        assertEquals("NMA0007", seq.getName());
        assertEquals(235, seq.length());

        seq = sequenceDB.getSequence("NMA0020");
        assertEquals("NMA0020", seq.getName());
        assertEquals(494, seq.length());

        seq = sequenceDB.getSequence("NMA0030");
        assertEquals("NMA0030", seq.getName());
        assertEquals(245, seq.length());

Hope this is useful,

Keith

-- 

- Keith James <kdj at sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -