[Open-bio-l] Common Sample Data Collection, was: SCF files (Staden)

Peter Cock p.j.a.cock at googlemail.com
Wed Nov 30 11:42:22 UTC 2011


On Wed, Nov 30, 2011 at 11:38 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 11/30/2011 11:32 AM, Pjotr Prins wrote:
>
>> Git is not very good for storing large data files, which we would want
>> to fetch partially. My suggestion would be to have a plain old file
>> repo, e.g. on S3, which can be mirrored by others.
>
> We had issues with large files in the EMBOSS release, and make those
> available via rsync to add to the developers CVS checkout. They include the
> NCBI taxonomy source and index files and the ontology source and index
> files.
>
> The next EMBOSS release will include http and ftp URLs as valid inputs for
> any data type, so EMBOSS could use remote files for format tests. I' look
> into how other repositories could be added.
>
> I had to add some extra qualifiers to allow queries and offsets to be
> specified, and rewrote the query language parsing to merge very similar code
> segments.
>
> regards,
>
> Peter Rice
> EMBOSS Team

How about an OBF hosted FTP site then if we want big data?
I guess we'd mostly be adding files, and changes/deletions
should be rare, so a full version tracking repository isn't
essential if we are disciplined about updating README files
or more formal meta data.

Peter



More information about the Open-Bio-l mailing list