[BioRuby] using Bio::FlatFileIndex

Tue Dec 11 14:59:52 UTC 2007

Hi,

Indexes can be generated with a command-line application br_bioflat.rb
or within Ruby script.

Example: creates an index from command line:

% br_bioflat.rb --create --type flat --location /home/xx/dbidx \
  --dbname test --files /home/xx/test01.fst /home/xx/test02.fst

equivalent ruby script:

  require 'bio'
  is_bdb = nil # is_bdb = Bio::FlatFileIndex::MAGIC_BDB for BDB index
  dbname = '/home/xx/dbidx/test'
  format = nil # file format is automatically determined
  options = {}
  files = ['/home/xx/test01.fst', '/home/xx/test02.fst' ]
  Bio::FlatFileIndex.makeindex(is_bdb, dbname, format, options, *files)

As Bio::FlatFileIndex was first written in 2002 and is
very old, the API is ugly. In addition, its internal structure
is too complicated. It may be rewritten and the API might
be changed in the future.

Addes files to the index:

% br_bioflat.rb --update --location /home/xx/dbidx \
  --dbname test --files /home/xx/test03.fst /home/xx/test04.fst

equivalent ruby script:

  require 'bio'
  dbname = '/home/xx/dbidx/test'
  options = {}
  files = ['/home/xx/test03.fst', '/home/xx/test04.fst' ]
  Bio::FlatFileIndex::update_index(dbname, nil, options, *files)

Re-read all files and re-generate the index:

% br_bioflat.rb --update --location /home/xx/dbidx \
  --dbname test --renew

equivalent ruby script:

  require 'bio'
  dbname = '/home/xx/dbidx/test'
  options = {}
  options['renew'] = true
  Bio::FlatFileIndex::update_index(dbname, nil, options, [])

Note that add files or updating the flat database (without BDB)
is very slow because it actually rebuilds indexes again.

Retrieving sequences in the index:

% br_bioflat.rb --location /home/xx/dbidx --dbname test M12963

equivalent ruby script:

  require 'bio'
  dbname = '/home/xx/dbidx/test'
  key = 'M12963'
  idx = Bio::FlatFileIndex.open(dbname)
  results = idx.search(key)
  results.each do |str|
    print str
  end
  idx.close

'results' is a Bio::FlatFileIndex::Results object.
Each search result is an string.

(For more information, please see RDoc
http://bioruby.org/rdoc/classes/Bio/FlatFileIndex/Results.html )

If you want subsequence of fasta formatted data,
for example,

  require 'bio'
  dbname = '/home/xx/dbidx/test'
  key = 'M12963'
  result = idx.search(key)
  result.each do |str|
    ent = Bio::FastaFormat.new(str)
    # for nucleic acid sequence
    puts ent.naseq[0..100]
    # for amino acid sequence
    puts ent.aaseq[0..100]
    # nucleic or amino acid sequence
    puts ent.seq[0..100]
  end
  idx.close

Please see OBDA flat file indexing specifications
for philosophy and internal structure of index.

http://code.open-bio.org/cgi/viewcvs.cgi/obda-specs/flatfile/?cvsroot=obf-common

Thanks,

Naohisa Goto
ng at bioruby.org / ngoto at gen-info.osaka-u.ac.jp

On Mon, 10 Dec 2007 17:21:31 -0000
"Schwach Frank Dr \(CMP\)" <F.Schwach at uea.ac.uk> wrote:

> 
> Hi,
> 
> I need to retrieve sequences from fasta files. In Perl I used to do this with Bio::DB:fasta but at first I couldn't find an equivalent in Bioruby and was almost about to give up and use Perl for this purpose when I found Bio::FlatFileIndex. 
> Unfortunately, this class is not very well documented (unless I missed something). I think I can more or less figure out most of it from the code and the comments in the rdoc (http://bioruby.org/rdoc/classes/Bio/FlatFileIndex.html) but it would really be great to have some examples from people who are more familiar with this class, especially since I am relatively new to Ruby still.
> 
> What I want to do is simply:
> 
> 1) Build an index for a directory containing a few fasta files
> 2) In a Rails App (or any other Ruby script): retrieve sequences by their accessions and update the index if the fasta db is updated by the user.
> 
> Some of the questions I have are:
> What are the options that I can pass to the makeindex method?
> In Bioperl it is possible to retrieve a subsequence straight away like this:
> 
>  my $seq_db_obj = Bio::DB::Fasta->new($path_to_db); 
>  my $seq = $seq_db_obj->seq($accession, $start, $end) ; # retrieve (sub)sequence from the database
> 
> Can I do this in Ruby too or would I retrieve the entire sequence and then get the subsequence from that?
> 
> Any help and examples welcome!
> Thanks a lot!
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby