[Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden)

Thu Dec 15 20:36:35 UTC 2011

Chris,

> Reason I ask, the various Bio* and EMBOSS projects have a share of old (and
> possibly duplicate) data examples, but it might be nice to standardize on a
> common set of records, simply for less data duplication.
>
> As an example, have a git repo of purely data or links to data that we could
> 'git submodule' in for code distribution, release, and testing purposes, but
> that wouldn't bloat the code repository.

It is debatable if version control is necessary for this, each sample
entry is a snapshot obtained from the data source thus there is only
ever one version of each file, and a file for each format version is
required for testing purposes anyway. So for a test data archive plain
old FTP would be sufficient, with fetch scripts if required.

Since the historic data is most useful for compatibility testing and
the archives all have web services attached (e.g. EMBL-SVA and UniSave
are available through dbfetch, see
http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases) fetching
the required entries when/if necessary seems a more appropriate
approach. For example I doubt if many people need to test
compatibility with the Swiss-PROT 9.0 (1988) entry format
(http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002.1),
so there is little need to duplicate this data in every developers
set-up. In contrast users expect everything to be tested with the
current entry format, also available from UniSave
(http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002), but
the only way to be sure it is the current data is to fetch it from one
of the primary sources.

Given the major databases implement versioning schemes fetching
specific versions of entries is simple. For less well known databases,
databases which are no longer active or application results (the evils
of NCBI BLAST output parsing come to mind) having a standard set would
be useful. However in the case of the databases the data resources may
provide appropriate mechanisms to fetch this data, so building fixed
sets may be unnecessary for anything other than caching. Unless you
are talking large data sets, in which case you are going to want them
to be optional anyway, and you certainly don't want to put them under
version control.

So for the databases with archives I would tend towards just keeping
identifier lists for representative entries and fetching the required
entry flavours when/if required. This prevents duplication, ensures
current data tests are against the current data, and provides the
option of shipping the fetch script instead of the data for cases
where copyright licensing is an issue. For the rest a collection of
static files on an FTP/web site would have it covered.

Hamish

> On 12/15/2011 12:01 PM, Hamish McWilliam wrote:
>>
>> Hi Chris,
>>
>>> That might be the best source to pull from.  Does it archive old file
>>> examples (such as older SwissProt/GenBank/EMBL)?
>>
>> EDAM itself does not store entry data, and at the moment it does not
>> describe the changes to formats over time, although I'm sure this
>> could be added along with links to sample entries in the various data
>> archives.
>>
>> If you only need a few sample entries, see the appropriate database
>> archive:
>>
>> - EMBL-Bank Sequence Version Archive (EMBL-SVA):
>> http://www.ebi.ac.uk/cgi-bin/sva/sva.pl.
>> E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go
>> - UniProtKB Sequence/Annotation Version Archive (UniSave):
>> http://www.ebi.ac.uk/uniprot/unisave/
>> E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go
>> - NCBI Entrez Revision History.
>> E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist
>>
>> If you need more entries...
>>
>> For Swiss-PROT and UniProtKB old versions of the data are available on
>> the FTP sites, for example from EMBL-EBI:
>> - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/
>> - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/
>>
>> For GenBank, Don Gilbert collected various old releases a while back
>> (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html),
>> these are available via the BioMirrors (http://www.bio-mirror.net/).
>> NCBI may also be able to provide old releases on request.
>>
>> For EMBL-Bank old releases can be made available on request, contact
>> ENA (http://www.ebi.ac.uk/ena/about/contact) for more information.
>>
>> All the best,
>>
>> Hamish
>>
>>> chris
>>>
>>> On Nov 30, 2011, at 8:49 AM, Peter Cock wrote:
>>>
>>>> I just checked with Jon and he was happy to forward this back to
>>>> the list, and also added a couple of URLs that I'd asked about:
>>>>
>>>> http://bioportal.bioontology.org/ontologies/44600
>>>> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM
>>>>
>>>> Peter
>>>>
>>>> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison<jison at ebi.ac.uk>  wrote:
>>>>>
>>>>> Hi Peter (and Peter)
>>>>>
>>>>> Just a quick note to say that all (well, nearly all) common
>>>>> bioinformatics data formats are
>>>>> catalogued in the EDAM ontology:
>>>>>
>>>>> http://sourceforge.net/projects/edamontology/files
>>>>> http://edamontology.sourceforge.net/
>>>>>
>>>>> OK - there's bound to be some we've missed :)
>>>>>
>>>>> Anyhow, I thought it might help to structure any effort to document
>>>>> data formats (an effort which
>>>>> I wholeheartedly approve of by the way).  One thing I'd like to add to
>>>>> the EDAM "format"
>>>>> definitions is a link to the format specification, or failing that, an
>>>>> example.
>>>>>
>>>>> Cheers both
>>>>>
>>>>> Jon
>>
>> _______________________________________________
>>
>> Open-Bio-l mailing list
>> Open-Bio-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>
>

-- 
----
"Saying the internet has changed dramatically over the last five years
is cliché – the internet is always changing dramatically" - Craig
Labovitz, Arbor Networks.