[EMBOSS] error with dbxflat

Hamish McWilliam hpm at ebi.ac.uk
Thu Jul 17 10:58:51 UTC 2014


Hi Jean,
> I ran into problem while processing the most recent refseq nt database
> with dbxflat. The script has been running for years without problem
> until this release 65.
>
> dbxflat -dbname=refseqnt -dbresource=embl -idformat=gb -directory='.'
> -fields='id,acc' -filenames='complete*.*.gbff' -indexoutdir='.'
> -release=65 -date=05/19/14
> -outfile=/usr/local/emboss/logs/dbxflat_refseqnt_log
> Index a flat file database using b+tree indices
> Warning: id 'NZ_CAAF010000001' too long, truncating to idlen 15
>
>     EMBOSS An error in ajindex.c at line 1325:
> ReadBucket: Bucket too full
>
> Any help will be appreciated. Thank you.

I suspect this is a case where the truncation of the identifier means
there are are many entries which are being associated with the same
identifier, which is in turn causing part of the index structure to
contain too many entries.

The warning, and the associated truncation, is caused by the resource
definition containing maximum length limits that are lower then the
tokens being found in the data. From the use of the 'embl' name, I am
guessing that you are using the old resource definition for EMBL-Bank
from "$EMBOSS_ROOT/share/EMBOSS/emboss.default.template" (commented in
in the version shipped with EMBOSS 6.6.0):

#RES embl [ type: Index
#  idlen:  15
#  acclen: 15
#  svlen:  15
#  keylen: 15
#  deslen: 15
#  orglen: 15
#]

FYI more recent definitions can be found in
"$EMBOSS_ROOT/share/EMBOSS/emboss.standard", sadly none of these appear
to have been configured for RefSeq.

The resource definition (RES or RESOURCE) needs to be configured for the
database being indexed with dbx (see
http://emboss.open-bio.org/html/adm/ch04s05.html#d0e12053) In the case
of RefSeq you will need larger maximum values for the 'idlen', 'acclen'
and 'svlen', you may also want to increase the other values for the
other fields as warnings are reported by dbxflat. As far as I can
remember the longest RefSeq identifiers are present in the NZ section so
you will want to start there.

Once you have added an appropriate resource definition to your EMBOSS
configuration, try again using the name of the resource for the
'-dbresource' parameter.

If after creating an appropriately configured resource for the data you
still encounter the error, try isolating the problem by indexing the
RefSeq data in sections.

All the best,

Hamish
-- 
============================================================
Mr Hamish McWilliam,
Web Production,
European Bioinformatics Institute (EMBL-EBI),
European Molecular Biology Laboratory,
Wellcome Trust Genome Campus,
Hinxton, Cambridge, CB10 1SD
United Kingdom

URL: http://www.ebi.ac.uk/
============================================================



More information about the EMBOSS mailing list