[Open-bio-l] Status of OBDA and indexed flatfiles?

Wed Sep 2 06:45:08 UTC 2009

Hi,

On Mon, 31 Aug 2009 16:07:28 +0100
Peter <biopython at maubp.freeserve.co.uk> wrote:

> On Mon, Aug 31, 2009 at 3:01 PM, Naohisa
> GOTO<ngoto at gen-info.osaka-u.ac.jp> wrote:
> > Hi Peter,
> >
> >> Presumably BioPerl still uses these index files? What about the
> >> other projects? I know EMBOSS has some indexing system for
> >> example but I have no idea how it works internally.
> >
> > BioRuby still uses them. To gain performance, names and offsets are
> > written to temporary files and using external sort program (default
> > /usr/bin/sort).
> 
> That makes sense. Have you tried this on very large files? e.g.
> FASTA with 10 million short reads?

Using BioRuby's br_bioflat.rb on a Linux server
(CPU: Pentium D 3.4GHz, memory: 4GB, HDD: SATA 300GB),
it takes about 43 minutes to create a flat-file index of
10,000,000 randomly generated FASTA sequences (each sequence
length is 100-500 bp, total file size about 3 GB).
To retrieve 10,000 sequences from the index takes 133 seconds
on the same server.

Naohisa Goto
ng at bioruby.org / ngoto at gen-info.osaka-u.ac.jp