[EMBOSS] dbxflat and size of index files

ajb at ebi.ac.uk ajb at ebi.ac.uk
Wed Oct 31 22:07:24 UTC 2007


Hello Jérôme,

Yes, it is normal. It is a combination of three things. First, it is a
tree structure, secondly the tree isn't tightly packed and thirdly
64-bit pointers are used throughout. The first will
allow on-the-fly updating of the index, the second is for speed of
construction/updating and the third is obvious. Another
consideration is that, in some cases, the indexes are trees-of-trees
to allow duplicate codes to be indexed (e.g. keywords).

Coincidentally I'm on the lookout for new indexing algorithms at the
moment so, if you have a favourite one then we're always open
for suggestions.

Alan


> Hello,
>
> I use dbxflat to index uniprot (sprot and trembl) flat files for
> which the size is 1.2 G for sprot and 11 G for trembl. The resulting
> files are amazingly huge: 11 G. Is it normal?
>
> Another example with Genbank flat files: the division gbsts has a
> size of 3.3 G. Indexing with dbxflat give 6.8 G of index files but
> with dbiflat give only 199 M of index files. I know its not necessary
> to index genbank flat files with dbxflat because each individual file
> is not bigger than 300 M. I did this just for the demonstration.
>
> Apart of this, all is working very well.
>
> Thank you in advance.
>
>
> Jérôme Laroche
>
> Centre de bioinformatique et de biologie computationnelle
> Université Laval
>
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>





More information about the EMBOSS mailing list