[Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden)

Fri Dec 16 21:12:45 UTC 2011

On 12/15/2011 02:36 PM, Hamish McWilliam wrote:
> Chris,
>
>> Reason I ask, the various Bio* and EMBOSS projects have a share of old (and
>> possibly duplicate) data examples, but it might be nice to standardize on a
>> common set of records, simply for less data duplication.
>>
>> As an example, have a git repo of purely data or links to data that we could
>> 'git submodule' in for code distribution, release, and testing purposes, but
>> that wouldn't bloat the code repository.
> It is debatable if version control is necessary for this, each sample
> entry is a snapshot obtained from the data source thus there is only
> ever one version of each file, and a file for each format version is
> required for testing purposes anyway. So for a test data archive plain
> old FTP would be sufficient, with fetch scripts if required.
I agree; that or a combination of the two where appropriate.  Caveats below.
> Since the historic data is most useful for compatibility testing and
> the archives all have web services attached (e.g. EMBL-SVA and UniSave
> are available through dbfetch, see
> http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases) fetching
> the required entries when/if necessary seems a more appropriate
> approach. For example I doubt if many people need to test
> compatibility with the Swiss-PROT 9.0 (1988) entry format
> (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002.1),
> so there is little need to duplicate this data in every developers
> set-up. In contrast users expect everything to be tested with the
> current entry format, also available from UniSave
> (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002), but
> the only way to be sure it is the current data is to fetch it from one
> of the primary sources.
>
> Given the major databases implement versioning schemes fetching
> specific versions of entries is simple. For less well known databases,
> databases which are no longer active or application results (the evils
> of NCBI BLAST output parsing come to mind) having a standard set would
> be useful. However in the case of the databases the data resources may
> provide appropriate mechanisms to fetch this data, so building fixed
> sets may be unnecessary for anything other than caching. Unless you
> are talking large data sets, in which case you are going to want them
> to be optional anyway, and you certainly don't want to put them under
> version control.
The key concerning word there is 'may', and I hesitate to rely on the 
certainty that some web services will be available indefinitely.  A 
recent talk I attended (I believe at the last Galaxy Conference) 
mentioned the percentage of published web services that have persistent 
URLs over the years (e.g. found in the same location or are redirected 
to a new location).  The number is depressingly low, I want to say 
25-40%, but not sure.

Even for major databases, we have had web service apps move or 
disappear, just fixed one related to NCBI revision history.  In most 
cases there are notifications, but not always.

Just from that perspective alone, I find static files to be a nice fallback.
> So for the databases with archives I would tend towards just keeping
> identifier lists for representative entries and fetching the required
> entry flavours when/if required. This prevents duplication, ensures
> current data tests are against the current data, and provides the
> option of shipping the fetch script instead of the data for cases
> where copyright licensing is an issue. For the rest a collection of
> static files on an FTP/web site would have it covered.
>
> Hamish
It's not an unreasonable expectation that some parsers would need 
support for both old and new ('old' being something within a sane time 
period).

Regardless, most (all?) OBF projects have been around long enough this 
isn't often an issue, but there are data formats that rapidly evolve 
(NCBI BLAST text being an example).  Even that is a bit of an exception, 
as NCBI has long recommended not relying on text parsing as being stable 
as they reserve the right to add changes that may break things (so for 
users caveat emptor).

chris
>> On 12/15/2011 12:01 PM, Hamish McWilliam wrote:
>>> Hi Chris,
>>>
>>>> That might be the best source to pull from.  Does it archive old file
>>>> examples (such as older SwissProt/GenBank/EMBL)?
>>> EDAM itself does not store entry data, and at the moment it does not
>>> describe the changes to formats over time, although I'm sure this
>>> could be added along with links to sample entries in the various data
>>> archives.
>>>
>>> If you only need a few sample entries, see the appropriate database
>>> archive:
>>>
>>> - EMBL-Bank Sequence Version Archive (EMBL-SVA):
>>> http://www.ebi.ac.uk/cgi-bin/sva/sva.pl.
>>> E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go
>>> - UniProtKB Sequence/Annotation Version Archive (UniSave):
>>> http://www.ebi.ac.uk/uniprot/unisave/
>>> E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go
>>> - NCBI Entrez Revision History.
>>> E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist
>>>
>>> If you need more entries...
>>>
>>> For Swiss-PROT and UniProtKB old versions of the data are available on
>>> the FTP sites, for example from EMBL-EBI:
>>> - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/
>>> - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/
>>>
>>> For GenBank, Don Gilbert collected various old releases a while back
>>> (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html),
>>> these are available via the BioMirrors (http://www.bio-mirror.net/).
>>> NCBI may also be able to provide old releases on request.
>>>
>>> For EMBL-Bank old releases can be made available on request, contact
>>> ENA (http://www.ebi.ac.uk/ena/about/contact) for more information.
>>>
>>> All the best,
>>>
>>> Hamish
>>>
>>>> chris
>>>>
>>>> On Nov 30, 2011, at 8:49 AM, Peter Cock wrote:
>>>>
>>>>> I just checked with Jon and he was happy to forward this back to
>>>>> the list, and also added a couple of URLs that I'd asked about:
>>>>>
>>>>> http://bioportal.bioontology.org/ontologies/44600
>>>>> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM
>>>>>
>>>>> Peter
>>>>>
>>>>> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison<jison at ebi.ac.uk>    wrote:
>>>>>> Hi Peter (and Peter)
>>>>>>
>>>>>> Just a quick note to say that all (well, nearly all) common
>>>>>> bioinformatics data formats are
>>>>>> catalogued in the EDAM ontology:
>>>>>>
>>>>>> http://sourceforge.net/projects/edamontology/files
>>>>>> http://edamontology.sourceforge.net/
>>>>>>
>>>>>> OK - there's bound to be some we've missed :)
>>>>>>
>>>>>> Anyhow, I thought it might help to structure any effort to document
>>>>>> data formats (an effort which
>>>>>> I wholeheartedly approve of by the way).  One thing I'd like to add to
>>>>>> the EDAM "format"
>>>>>> definitions is a link to the format specification, or failing that, an
>>>>>> example.
>>>>>>
>>>>>> Cheers both
>>>>>>
>>>>>> Jon
>>> _______________________________________________
>>>
>>> Open-Bio-l mailing list
>>> Open-Bio-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>
>
>