[Open-bio-l] OBDA redux?

Peter Cock p.j.a.cock at googlemail.com
Fri Nov 18 10:20:54 UTC 2011


On Fri, Nov 18, 2011 at 9:35 AM, Raoul Bonnal wrote:
> Dear all,
> Would be possible to have a test dataset and clear requirements,
> functionalities? Not a huge doc, just few points for benchmarking.

I was thinking of using the UniProt SProt and TrEMBL datasets
as test cases (FASTA, plain text "swiss", and UniProt-XML format).
These have 532,792 and 17,651,715 records each (in the version
I have on disk - they've just released an update), which is a good
size, but not in the scale where we might start to worry about
SQLite scaling.
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/

So, we'd also want some thing else like some big FASTQ files with
100M -> 500M records (or more). Perhaps we'll have to combine a
couple of SRA data files together for that, which is fine.

Also a full GenBank download would be good, e.g. the EST dataset
files gbest1.seq.gz to gbest209.seq.gz would make a good test of
indexing multiple files together as a single database:
ftp://ftp.ncbi.nih.gov/genbank/

Peter



More information about the Open-Bio-l mailing list