[Open-bio-l] OBDA redux?

Fri Nov 18 10:55:48 UTC 2011

On 18/11/11 11.20, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Fri, Nov 18, 2011 at 9:35 AM, Raoul Bonnal wrote:
>> Dear all,
>> Would be possible to have a test dataset and clear requirements,
>> functionalities? Not a huge doc, just few points for benchmarking.
> 
> I was thinking of using the UniProt SProt and TrEMBL datasets
> as test cases (FASTA, plain text "swiss", and UniProt-XML format).
> These have 532,792 and 17,651,715 records each (in the version
> I have on disk - they've just released an update), which is a good
> size, but not in the scale where we might start to worry about
> SQLite scaling.
> ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/comple
> te/
> 
> So, we'd also want some thing else like some big FASTQ files with
> 100M -> 500M records (or more). Perhaps we'll have to combine a
> couple of SRA data files together for that, which is fine.
> 
> Also a full GenBank download would be good, e.g. the EST dataset
> files gbest1.seq.gz to gbest209.seq.gz would make a good test of
> indexing multiple files together as a single database:
> ftp://ftp.ncbi.nih.gov/genbank/
> 
It's a stating point.

And which are the information you want to extract once you have your index ?

--
Ra