Getting database sizes for indexed databases

David Martin d.m.a.martin at dundee.ac.uk
Thu Mar 6 15:24:34 UTC 2003


On 6/3/03 3:13 pm, "Peter Rice" <pmr at ebi.ac.uk> wrote:

> David Martin wrote:
>> Is there an easy way to determine the size (ie number of sequences) in a
>> database?
>> Could such information be added to showdb?
>> 
>> I am thinking of emblcd indexed databases. It must be possible to count the
>> number of sequences in each file indexed and then use the file: wild cards
>> to build a total for that database.
> 
> For EMBLCD databases it can be read from the index files (the number is
> in the header).

It can if the database definition matches the files indexed.

eg I have swiss, trembl and trembl_new all indexed together as sptr.

I use the same index files for sw (swiss only) and trembl
(trembl+trembl_new)


The index file will give the total for all three or the total for each file?

If it is the total for each file then the true count for subdivisions of the
database can be found by matching the file: definition to the list of files
int he header.
If it is the global total then there is no direct way of determining the
database size without specifically capturing that with the indexing program.

What is the header format for the index files?

..d

> 
> For SRS databases a simple query can return the count.
> 
> For complex cases there may be no answer - but we can either write a
> short message, or add an attribute to the database definition in
> emboss.default.
> 
> So ... is this a useful addition to showdb?
> 
> Peter
> 
> 

-- 
David Martin PhD
Bioinformatics Scientific Officer
Post-Genomics and Molecular Interactions Centre
University of Dundee




More information about the EMBOSS mailing list