[Biojava-l] extra seqDB things to add

Matthew Pocock mrp@sanger.ac.uk
Thu, 13 Jul 2000 16:59:19 +0100


Hi.

Gerald Loeffler wrote:

> Matthew Pocock wrote:
> > FileIndexerSequenceDB
> > - indexes a list of files
> > - uses a normal seq.io object to specify the format
> > - creates a file bla.index with the indexing info
> > - possibly auto-manages updates using file dates/times
>
> could use Berkeley DB (which has a Java API) for indexing so as not to
> reinvent the wheel...

In practice it's a fairly simple-stupid wheel.

> On the other hand i'm not sure whether it's wise to introduce yet
> another indexing mechanism - we already have NCBI-BLAST, WU-BLAST, SRS
> which all index the (huge) sequence databases in incompatible ways.
> Wouldn't it be better to write a SRSSequenceDB which would be a
> SequenceDB that
>         o either knows how to decipher the SRS index files and create Sequence
> objects from that
>         o or (alternatively) knows how to load a sequence file (in e.g. EMBL
> format) from the command-line (getz) or web-version of SRS and construct
> a Sequence object based on that,
>         o or (alternatively) knows how to load a sequence file (in GenBank
> format) from Entrez and construct a Sequence object based on that.
>
>         cheers,
>         gerald
>

SRSSequenceDB would be great. I like the idea of reading the SRS index files. Are
they inteligable? A FetcherSequenceDB that you parameterize with a little fetch
method and sequence format would also be good to have arround (we could provide
getz & wgetz, efetch etc. implementations).

The indexer is realy amied at the relatively common case where you have 3
fasta-files with your interesting sequences spread among them (exons between 150,
230 nt long from sachDB), and need random access to them. The files are not
integrated to SRS, as only you think that they are interesting, and SRS is scary.
It then allows you to do a getSequence(id), and efficiently pull out the
apropriate chunk of the file. Next week, you blow these files away, and forget all
about them (you now are interested in introns containing repeat elements from
mouse).

Am I trying to create a solution for which there is no problem?

Matthew