[EMBOSS] EMBOSS 5.0

Fri Jul 20 22:21:55 UTC 2007

Thank you for v5.0 (I have compile-ok on Mandriva Linux 2005 and
Ubuntu Linux 6.06 server version.)

Below are my observations from trying install EMBL locally; I did
it my own way in the end, so maybe not so useful for this list. I
do not need very fast entry access, just a local cache that avoids
request-flooding EBI/NCBI for entries and reduces network traffic.

First I tried dbxflat. It works fine, but indexing takes time; I
estimated a weeks run-time to index the release, and close to a
day for the daily updates on the average (the planned incremental
indexing will help). This means there has to be a machine dedicated
to keep EMBL up to date, because it is cpu-bound. Not unreasonable,
but I wanted to have it work on a cheapo external harddrive, say,
and for it to be ready sooner.

Then I tried BioPerl. It looked like 3 weeks for that run to finish,
so not workable. I have not looked how many entries change between
releases, but having the new release built soon after its available
must be good.

Then I tried to put each record in directories derived from their
accession number: AACI02000001 would be put in AACI/0200, AX101010
in AX1/010, and so on. Each directory has a two column table
(LOOKUP_LIST) with lines like these,

AACI02000001.1  1
AACI02000002.1  1
AACI02000003.1  1
AACI02000004.1  2
AACI02000005.1  2
AACI02000006.1  2
AACI02000007.1  2

where column 1 is the versioned accession number and the second is
the file name that contains its corresponding entry. The entry
files are gzip-compressed and named 1.gz, 2.gz, etc. The release
files stay compressed and are deleted after splitting, so the total
extra space required does not exceed 20% or so of the distribution
size. Creation time is about 26 hours and a typical daily-file is
2-4 minutes; download and import can run in parallel (not done by
threads of course, but by launching the script twice). I tried to
make a balance between disk and ram by caching file handles etc,
but ram does not exceed 330 mb at any time (and stays much lower
most of the time). To access a record, I do "zcat $file | seqret
..... " which is then parsed by bioperl. The access time varies
between 0.03 seconds to 0.3 seconds depending on size, time since
last access, speed of the disk, compression ratio and logic,
humidity outside etc.

Well, I may again have redone something, but at least it filled my
little need. I dont know if others have had the same, or if it is
a feature that EMBOSS should have.

Niels L

PS - one of my mistakes was to get a big slow USB-2 drive under
Linux. The drive is ok, but the ext3 file system broke completely.
I was advised to use firewire or ATA/SATA instead, which allows
health-monitoring with smartctl et al as well (USB does not).