[BioRuby] BioRuby's Bio::FlatFileIndex compatibility with BioPerl's Bio::DB::Flat

Sun Jul 22 10:25:00 UTC 2007

Hello,

I'm a maintainer of Bio::FlatFileIndex in bioruby.

On Fri, 20 Jul 2007 14:54:43 -0400
"Aidan Findlater" <aidanfindlater at gmail.com> wrote:

> *Summary:* Attached is a diff that allows Bio::FlatFileIndex to access BDB
> flatfile databases created by BioPerl. I have not changed the way BioRuby
> creates its databases, so this likely breaks access to BioRuby-created
> flatfiles.
> 
> 
> *Description:* I have some flatfile databases that were created with
> BioPerl, but it seems that BioRuby does things a little differently.
> Specifically, BioRuby tries to get config and fileid information from BDB
> databases; BioPerl stores this information in config.dat.

The OBDA flat-file indexing specification (*1) says that
configiguration data is stored in the BDB database, not config.dat.

(excerpted from indexing.txt (*1))
| 2) The subdirectory contains a file named "config.dat" containing tab
| separated key/value pairs.  The first line contains the key "index"
| and value "index\tBerkeleyDB/1".  This means the first few characters
| of the config.dat file is "index\tBerkeleyDB/1\n".
| 
| There is no other data in this file.
| 
| 3) Global configuration data is stored in the database named "config".

The specification text was last modified in 5 years ago,
and it might have been changed in somewhere I don't know.
Does someone know changes of specifications,
or how to get new specification text?

*1 http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/obda-specs/flatfile/indexing.txt?rev=1.3&cvsroot=obf-common&content-type=text/vnd.viewcvs-markup

> As well, it returns sequences shifted one character to the right (the '>'
> from my FASTA file was at the end of the returned sequence, and none was at
> the beginning).

I suppose this is BioPerl's indexer's issue.

I prepared the file /tmp/flat/tmp.fst as below.
-----------------------------------------------------------
>TEST00001                                    EOL
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>TEST00002                                    EOL
ccccccccccccccccccccccccccccccccccccccccccccccccc
>TEST00003                                    EOL
ggggggggggggggggggggggggggggggggggggggggggggggggg
>TEST00004                                    EOL
ttttttttttttttttttttttttttttttttttttttttttttttttt
-----------------------------------------------------------
(Each line of the above file is 50 byte in UNIX).

% bp_bioflat_index.pl --create --format fasta \
  --location /tmp/flat --dbname testbdb --indextype bdb \
  /tmp/flat/tmp.fst

Then, I confirmed the contents of generated BDB data.

% ruby -r bdb -e 'BDB::Btree.open("/tmp/flat/testbdb/key_ACC").to_a.sort.each { |x| puts x.join("\t") }'
TEST00001       0       0       101
TEST00002       0       101     100
TEST00003       0       201     100
TEST00004       0       301     99

(Each column shows ID, FileID, start position, and size.)

The start positions of TEST00002, TEST00003, and TEST00004
are wrong, and the size of TEST00001 and TEST00004 is wrong.

I'm using BioPerl 1.5.2_102.

% perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"'
1.005002102

In addition, I also tried flat database.

% bp_bioflat_index.pl --create --format fasta \
  --location /tmp/flat --dbname testflat --indextype flat \
  /tmp/flat/tmp.fst

% cat testflat2/key_ACC.key
  19TEST00001   0       0       100  TEST00002  0       100     100TEST00003 0       200     100TEST00004    0       300     50 

It sesms that the index is correctly created.
However, according to the specification (*1),
the first 4 bytes of the key_ACC.key file should be "0019",
but was "  19" in the above index created with BioPerl.

(excerpted from indexing.txt (*1))
| Each record of this file is in a fixed width format.  There is no
| special termination character.  Instead, the first four bytes of the
| file contain the mapping record size, in bytes, represented as text
| string.  The string is left padded with zeros to fit in four bytes, so
| the allowed text strings are "0000", "0001", "0002", ..., "9999".

Regards,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ngoto at bioruby.org