EMBOSS - Indexing breaks on large databases

Len F. Zaifman leonardz at bioinfo.sickkids.on.ca
Wed Feb 7 21:32:35 UTC 2001


I have installed emboss 1.9.1 on an O2000. It installed nicely once I
gave up on installing it shared.

The issue came up in indexing genbank files. Most divisions indexed fine
with dbiflat. However, when I try to index 
est , or all of genbank , the indexing breaks due to sort running out of
memory:

explicitly:
 I run 
dbiflat -idformat GB -directory /data/genbank -indexdirectory
/tools/emboss1.9.1/data/indices/est -dbname GenBankEst -filenames
gbest*.seq  -date 06/02/01 -sortoptions '-T
/tmp_disk/scratch4/applicat/est -k1,1'
&
dbiflat -idformat GB -directory /data/genbank -indexdirectory
/tools/emboss1.9.1/data/indices/genbank -dbname GenBank -filenames
*.seq  -date 06/02/01 -sortoptions '-T
/tmp_disk/scratch4/applicat/genbank -k1,1'
& get

	UX:sort: ERROR: Out of memory before merge: Not enough space


sort is run with -T /scratch4   -k1,1   , where scratch4 has a 10 GB
quota
I checked the environment and it is using the system sort (/bin/sort).
There were no syslog errors.

All other smaller divisions seemed to work.  I have a scheduled reboot
where I am going to set the 
maximum resident set size to 1 GB (it is currently 1/2 GB). However, is
there a more clever way of doing this (ie if I did this on my work
station I would be limited to 1/8 GB or swap like crazy).

Details:

I configure using:
	./configure --prefix=/tools/emboss1.9.1  --disable-shared --with-x
--with-pngdriver 

	on an O2K running Irix 6.5.10 and the MipsPro 7.3.1.2 Compilers

Any ideas??



As a side note: when I tried indexing all of genbank I got almost 60000
sequences generating the following warning notice:



   This is a warning: Duplicate ID skipped: 'XXXXXXXX'

Is this an indication that the initial data needs to be cleaned up
first, or a non-issue?

Thanks.






More information about the emboss-dev mailing list