[Open-bio-l] Status of OBDA and indexed flatfiles?

Mon Aug 31 14:01:46 UTC 2009

Hi Peter,

On Mon, 31 Aug 2009 13:07:45 +0100
Peter <biopython at maubp.freeserve.co.uk> wrote:

> Hi all,
> 
> I'm looking at indexing next generation sequence files for Biopython
> (e.g. FASTQ short read files with 10s of millions of entries), where
> even just holding the record names and their file offsets in memory
> is beginning to be a bottleneck.
>
> What is the current status of Open Biological Database Access (OBDA),
> and in particular the index files for sequence "flat files" like FASTA or
> GenBank (or FASTQ)?
> 
> http://www.bioperl.org/wiki/HOWTO:Flat_databases
> http://www.bioperl.org/wiki/HOWTO:OBDA
> http://obda.open-bio.org/
>
> The spec files are still in CVS (and ViewCVS is still broken since
> the recent server move), rather than having been migrated to SVN
> which may suggest things are obsolete (or on the bright side, stable).
> 
> Presumably BioPerl still uses these index files? What about the
> other projects? I know EMBOSS has some indexing system for
> example but I have no idea how it works internally.

BioRuby still uses them. To gain performance, names and offsets are
written to temporary files and using external sort program (default
/usr/bin/sort).

In BioRuby, flatfile-only solution works fine, but BerkeleyDB indexes
would be incompatible with other projects, because of confusion in
the spec, discussed in BioPerl Bugzilla Bug #2337.
http://bugzilla.open-bio.org/show_bug.cgi?id=2337

Thanks,

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org