[Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden)

Tue Dec 20 11:27:12 UTC 2011

Hi Chris,

>>> Reason I ask, the various Bio* and EMBOSS projects have a share of old
>>> (and
>>> possibly duplicate) data examples, but it might be nice to standardize on
>>> a
>>> common set of records, simply for less data duplication.
>>>
>>> As an example, have a git repo of purely data or links to data that we
>>> could
>>> 'git submodule' in for code distribution, release, and testing purposes,
>>> but
>>> that wouldn't bloat the code repository.
>>
>> It is debatable if version control is necessary for this, each sample
>> entry is a snapshot obtained from the data source thus there is only
>> ever one version of each file, and a file for each format version is
>> required for testing purposes anyway. So for a test data archive plain
>> old FTP would be sufficient, with fetch scripts if required.
>
> I agree; that or a combination of the two where appropriate.  Caveats below.
>
>> Since the historic data is most useful for compatibility testing and
>> the archives all have web services attached (e.g. EMBL-SVA and UniSave
>> are available through dbfetch, see
>> http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases) fetching
>> the required entries when/if necessary seems a more appropriate
>> approach. For example I doubt if many people need to test
>> compatibility with the Swiss-PROT 9.0 (1988) entry format
>> (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002.1),
>> so there is little need to duplicate this data in every developers
>> set-up. In contrast users expect everything to be tested with the
>> current entry format, also available from UniSave
>> (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002), but
>> the only way to be sure it is the current data is to fetch it from one
>> of the primary sources.
>>
>> Given the major databases implement versioning schemes fetching
>> specific versions of entries is simple. For less well known databases,
>> databases which are no longer active or application results (the evils
>> of NCBI BLAST output parsing come to mind) having a standard set would
>> be useful. However in the case of the databases the data resources may
>> provide appropriate mechanisms to fetch this data, so building fixed
>> sets may be unnecessary for anything other than caching. Unless you
>> are talking large data sets, in which case you are going to want them
>> to be optional anyway, and you certainly don't want to put them under
>> version control.
>
> The key concerning word there is 'may', and I hesitate to rely on the
> certainty that some web services will be available indefinitely.  A recent
> talk I attended (I believe at the last Galaxy Conference) mentioned the
> percentage of published web services that have persistent URLs over the
> years (e.g. found in the same location or are redirected to a new location).
>  The number is depressingly low, I want to say 25-40%, but not sure.

One of the things to remember here is that the web services which can
be used to fetch the data depend on usage to justify their funding and
thus their future. Not using these services makes it more difficult
for the service provider(s) to justify the resource costs involved in
maintaining and running the services. For example: the support for
EMBL-Bank entry names (which were removed in EMBL-Bank release 87,
June 2006) in EBI dbfetch (e.g.
http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=embl&id=BUM&format=embl&style=raw)
is present due to the use of the entry name 'BUM' in the test and
example code for Bio::DB::EMBL, this feature is very rarely used
elsewhere.

It is well accepted that there is a lot of churn in the world of web
services, and the causes for this are well known. The various web
service registry projects (see
http://www.ebi.ac.uk/Tools/webservices/tutorials/05_registries for a
selection) attempt to provide some form of tracking, and a method for
finding replacement services. Projects such as ELIXIR
(http://www.elixir-europe.org/) hope to address some of the causes.

> Even for major databases, we have had web service apps move or disappear,
> just fixed one related to NCBI revision history.  In most cases there are
> notifications, but not always.

All the major providers provide notification of changes, and if
possible attempt to provide backward compatibility and transition
periods to allow users to migrate. Things could be clearer of course
and there will always be some services which exhibit a higher rate of
change with or without notice. PhD research projects come to mind as a
type of service which often changes rapidly before a short period of
stability and then unavailability or a long tail of no maintenance.

> Just from that perspective alone, I find static files to be a nice fallback.

As a fallback it may be a suitable option, this is covered by my
earlier comment about caching, however I would try to avoid
duplicating the data from the start and only provide static copies if
an alternative source is not available.

>> So for the databases with archives I would tend towards just keeping
>> identifier lists for representative entries and fetching the required
>> entry flavours when/if required. This prevents duplication, ensures
>> current data tests are against the current data, and provides the
>> option of shipping the fetch script instead of the data for cases
>> where copyright licensing is an issue. For the rest a collection of
>> static files on an FTP/web site would have it covered.
>
> It's not an unreasonable expectation that some parsers would need support
> for both old and new ('old' being something within a sane time period).

Having recently done some work with the patent side of things, I am no
longer sure that there is such a thing as a "sane time period", for
that community there is a requirement to be able to access all data
since the beginning of time, or at least make a reasonable attempt to
get hold of that data, in order to judge if a patent application can
be granted or a grant overturned. However they are a special case...
for most purposes I suspect that coverage of the last 5 years is
probably enough, as long as it is clear for each release what the data
support interval is. I'm sure there will be a few cases where a longer
interval may be necessary. For example: EMBL-Bank changed the format
of the ID line in release 87 (June 2006), but many of the other
databases using the EMBL-Bank format (e.g. IMGT/HLA) have not switched
to use the new format.

> Regardless, most (all?) OBF projects have been around long enough this isn't
> often an issue, but there are data formats that rapidly evolve (NCBI BLAST
> text being an example).  Even that is a bit of an exception, as NCBI has
> long recommended not relying on text parsing as being stable as they reserve
> the right to add changes that may break things (so for users caveat emptor).

NCBI BLAST has long been a favourite example of this, and NCBI say
that the ASN.1 or the XML should be used if you want to parse it.
Given the intermediate format support in NCBI BLAST+ this is now less
of an issue, since obtaining multiple formats from the results of a
single search is possible in the standalone version as well as in the
web service(s), so you can have the text report along side a parseable
representation if required. Although there are still many cases where
the text report has to be used and multi-version support is required,
for example: MView (http://bio-mview.sourceforge.net/) supports many
different versions of the BLAST format since those files tend to be
what users have saved from their searches, and the search may have
been some years back.

It also depends on how detailed your parsing is, the formats have
tended to change slowly in structure, but the specific details often
change rapidly. This has effects when performing some types of format
conversion, or performing data verification checks.

All the best,

Hamish

> chris
>
>>> On 12/15/2011 12:01 PM, Hamish McWilliam wrote:
>>>>
>>>> Hi Chris,
>>>>
>>>>> That might be the best source to pull from.  Does it archive old file
>>>>> examples (such as older SwissProt/GenBank/EMBL)?
>>>>
>>>> EDAM itself does not store entry data, and at the moment it does not
>>>> describe the changes to formats over time, although I'm sure this
>>>> could be added along with links to sample entries in the various data
>>>> archives.
>>>>
>>>> If you only need a few sample entries, see the appropriate database
>>>> archive:
>>>>
>>>> - EMBL-Bank Sequence Version Archive (EMBL-SVA):
>>>> http://www.ebi.ac.uk/cgi-bin/sva/sva.pl.
>>>> E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go
>>>> - UniProtKB Sequence/Annotation Version Archive (UniSave):
>>>> http://www.ebi.ac.uk/uniprot/unisave/
>>>> E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go
>>>> - NCBI Entrez Revision History.
>>>> E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist
>>>>
>>>> If you need more entries...
>>>>
>>>> For Swiss-PROT and UniProtKB old versions of the data are available on
>>>> the FTP sites, for example from EMBL-EBI:
>>>> - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/
>>>> - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/
>>>>
>>>> For GenBank, Don Gilbert collected various old releases a while back
>>>> (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html),
>>>> these are available via the BioMirrors (http://www.bio-mirror.net/).
>>>> NCBI may also be able to provide old releases on request.
>>>>
>>>> For EMBL-Bank old releases can be made available on request, contact
>>>> ENA (http://www.ebi.ac.uk/ena/about/contact) for more information.
>>>>
>>>> All the best,
>>>>
>>>> Hamish
>>>>
>>>>> chris
>>>>>
>>>>> On Nov 30, 2011, at 8:49 AM, Peter Cock wrote:
>>>>>
>>>>>> I just checked with Jon and he was happy to forward this back to
>>>>>> the list, and also added a couple of URLs that I'd asked about:
>>>>>>
>>>>>> http://bioportal.bioontology.org/ontologies/44600
>>>>>> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison<jison at ebi.ac.uk>    wrote:
>>>>>>>
>>>>>>> Hi Peter (and Peter)
>>>>>>>
>>>>>>> Just a quick note to say that all (well, nearly all) common
>>>>>>> bioinformatics data formats are
>>>>>>> catalogued in the EDAM ontology:
>>>>>>>
>>>>>>> http://sourceforge.net/projects/edamontology/files
>>>>>>> http://edamontology.sourceforge.net/
>>>>>>>
>>>>>>> OK - there's bound to be some we've missed :)
>>>>>>>
>>>>>>> Anyhow, I thought it might help to structure any effort to document
>>>>>>> data formats (an effort which
>>>>>>> I wholeheartedly approve of by the way).  One thing I'd like to add
>>>>>>> to
>>>>>>> the EDAM "format"
>>>>>>> definitions is a link to the format specification, or failing that,
>>>>>>> an
>>>>>>> example.
>>>>>>>
>>>>>>> Cheers both
>>>>>>>
>>>>>>> Jon
>>>>
>>>> _______________________________________________
>>>>
>>>> Open-Bio-l mailing list
>>>> Open-Bio-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>>
>>>
>>
>>
>

-- 
----
"Saying the internet has changed dramatically over the last five years
is cliché – the internet is always changing dramatically" - Craig
Labovitz, Arbor Networks.