[Open-bio-l] a common repository for test datasets/use cases for all Bio* projects

Giovanni Marco Dall'Olio dalloliogm at gmail.com
Thu Jan 15 11:21:49 UTC 2009


On Thu, Dec 4, 2008 at 6:06 PM, Jason Stajich <jason at bioperl.org> wrote:
> I don't know if this is really the best email list for this -- although not
> sure what other common list should be used.
>
> We actually a started a project like this many moons ago, but no one
> contributed examples...
>
> http://code.open-bio.org/cgi/viewcvs.cgi/biodata/

For the moment I am putting some use cases in this repository:
- http://github.com/dalloliogm/bio-test-datasets-repository/tree/master

What I am doing, basically, it is just to collect messages from the
biopython mailing list (hope I am not doing anything illegal) and
problems encountered in our lab work, and put them there.

If you give me access to the biodata's cvs or wiki I can put them
there (even if I would prefer a git repository). I don't have much
time to do more than this now... but over the time I can improve many
things.

Well.. let me just say this stupid thing now or later it will be too late :)
I don't like the name 'biodata'... what about something like
'biotests' or 'biodatasets', or 'bio-test-datasets'?


>
> We can start a common SVN repository for this if you like or a github on OBF
> if that is more likely to garner contributions.
> In terms of documentation - you are certainly welcome to make a
> documentation repository but I would argue a wiki or wiki-like soln would be
> best for documentation.
> Whether a common wiki can be maintained among the projects (or merge the
> wikifarms someday) is something to contemplate too.
>
> -jason
>
> On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote:
>
>> Hi!
>> My name is Giovanni, I come from biopython's mailing list.
>>
>> I would like to make you a proposal.
>> Every module/program written in bioinformatics needs to be tested
>> before it can be used to produce results that can be published.
>>
>> For example, let's say I want to write another fasta file parser, like
>> SeqIO.FastaIO in biopython : I would have have to test the script
>> against some real fasta files, just to make sure that it doesn't parse
>> them in a wrong way, or that it losts data.
>> Or, let's say I want to write a script to calculate Fst statistics
>> over some population genetics data: I will have to compare the results
>> of my scripts against other programs, check if it gives me the right
>> result for a set for which I already know the Fst value, and maybe
>> ideate some other kind of checks to be sure my script doesn't do weird
>> things, like losing input data on the way.
>>
>> So, the point is.. what if we create a common repository for all this
>> kind of testing data, to be used in common with all the other Bio*
>> projects?
>> Wouldn't it be good if all the Bio* fasta parser are able to parse the
>> same files and give the same results, demonstrating that all of them
>> work fine or are wrong at the same time?
>>
>> I am doing this because me (and Tiago), in the biopython mailing list,
>> would
>> like to develop a module to calculate Fst statistics over SNP data, and
>> there is no point of collecting some good test datasets and not sharing
>> them
>> with other similar projects in other programming languages.
>>
>> The same goes for much of the documentation, like use cases: if we
>> collect a good base of use cases related to bioinformatics, it would
>> be easier to coordinate the efforts of all the Bio* projects and
>> compare the different approaches used to solve the same issue by the
>> different comunities.
>>
>> At the moment, I have created a simple git repository on github:
>> - http://github.com/dalloliogm/bio-test-datasets-repository
>> but , it is still empty and maybe github is not the ideal hosting for
>> such a project, since the free account has a 100MB space limit.
>>
>>
>> --
>> -----------------------------------------------------------
>>
>> My Blog on Bioinformatics (italian): http://bioinfoblog.it
>> _______________________________________________
>> Open-Bio-l mailing list
>> Open-Bio-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>
> Jason Stajich
> jason at bioperl.org
>
>
>
>



-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it



More information about the Open-Bio-l mailing list