[EMBOSS] Files included in EMBOSS but licensed ...

Chris Fields cjfields at illinois.edu
Sat Jul 30 19:42:19 UTC 2011

On Jul 30, 2011, at 6:36 AM, Adam Sjøgren wrote:

> On Sat, 30 Jul 2011 09:58:07 +0100, Peter wrote:
>> A specific example might help. About 5 years ago a release of the
>> UniProt database (as plain text files) broke the Wisconsin (GCG)
>> sequence analysis package.
> [...]
> This is the opposite problem of what I tried to sketch.
> Your example has closed source software that can't be fixed, leading to
> either preprocessing or changing the database rather than fixing the
> real problem.
> If the software had been free, you could just have fixed the software.
> Switch around "software" and "database", and you have the example I was
> trying to paint.

Yes, if the source were available fixing the parser would have been the best option.  But I think you are missing the fundamental point that Peter made (that you left out): the wording of the license allowed them to reformat the file w/o changing the actual content.  I'm not sure but I believe many GenPept documents are Uniprot-derived and follow the same concept. 

Data records and databases are not software, unless you are using some very fuzzy definition of such.

>> I expect there are many problems that arise if data ... and
>> documentation ... are considered to be software.
> Sure. The whole GFDL debate took quite a while, I think.
> But that doesn't change that one of the solutions outlined by Charles
> Plessy is necessary for Debian to distribute EMBOSS (and any other piece
> of free/redistributable software).

You'll also note Charles's distaste for the options mentioned.  He was also searching for alternatives.

>>> (I personally think it would make sense to change to a Creative Commons
>>> license that allows derivative works - Uniprot and others are going to
>>> be the canonical source for the data anyway, so nothing will be lost by
>>> them by doing that, as far as I can see.)
>> Unlikely. The no-derivatives version is specifically there to prevent
>> derivatives - for example Debian distributing a modified UniProt
>> without permission.
> What I was trying to say is that I don't think that that clause gives
> any value to the owners of Uniprot and other databases.
> Why would Uniprot want to prevent derivative works? They'll always be
> the canonical source for the correct information.

The links provided in my other responce indicate some of the mindset behind this. I think the main point is that the work has to be attributed, and that any changes to such data need permission of Uniprot, likely so any content changes can be curated and (possibly) propogated to future releases. This also ensures that a set of files from a third-party containing the Uniprot name will not be modified (e.g. all content can be trusted as coming from Uniprot w/o modification).  

I have seen instances where loose data control (such as annotation from a newly sequenced genome) become balkanized to the point that no one can clearly state who is the trusted source (even when the list of sources includes large databases such as NCBI/EBI).  So I understand the reasoning for the license, but I also see Science Commons is recommending something less strict.

> You are free to distribute a modified version of the man-page for ls(1)
> - but if you introduce errors in it or make it worse, nobody will choose
> your derived version.

That's a straw man argument; man page documentation for an app is not the same as a database record based on scientific data.  Woud you make the same argument (allow free content modification) for a scientific publication?  I would, but only for corrections or for new data that support/contradict the original data, and even then it must go through some sort of mediation (an editor for instance), not unlike what a database curator does.

>> The ontologies are similar, but do allow for the use case of importing
>> terms from one ontology into another if the ontology name is changed
>> (and preferably if cross-references to the original are provided).
>> Again, the need is to protect the integrity of the original ontology
>> content so references to a GO term or a UniProt entry are clearly
>> defined.
> I think the problem that is being protected against is non-existing.
> People don't want to break stuff that works, they want to be able to fix
> stuff that doesn't.

Simply opening the licensing up for any content modification doesn't solve the problem in the case of scientific databases, it potentially exacerbates it.  Hence the variations in the licensing in the previous links I sent.  By the way, if you think the classic 'vi vs emacs' arguments can get out of control, see what happens when you have competing groups trying to make changes to a sequence record w/o curation.

I do agree that it would be nice for the barrier to database modification to be lowered. Many previous attempts have been made at doing this, such as including third-party annotation, but with the major databases they all seem to fall by the wayside and they seem to fall back to simple curation. 

Maybe it's time to come up with a git/hg for biological data, where one could fork records and make changes for submission; at least there one could have a trusted source and easier paths to data modification.  Just a thought.

>> This is essential for many of the public bioinformatics databases.
> Why? Only a hypothetical derivative would be changed, not the original.
> If someome distributed a derivative that was broken, I think people
> would quickly abandon it.

How could one tell the difference if both versions are implied to come from Uniprot (even if one comes from a third/fourth/fifth party)?  There is no guarantee beyond going back and comparing the records to the original Uniprot data.  

> Again, just my point of view - not representing or speaking for anyone :-)
>  Best regards,
>    Adam


Christopher Fields
Senior Research Scientist
National Center for Supercomputing Applications
Institute for Genomic Biology
University of Illinois Urbana-Champaign
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801

More information about the EMBOSS mailing list