[EMBOSS] Files included in EMBOSS but licensed ...

Chris Fields cjfields at illinois.edu
Sat Jul 30 19:01:58 UTC 2011

On Jul 30, 2011, at 3:58 AM, Peter Rice wrote:

> Quoted in full for the benefit of the debian-med list who missed the original posting
> On 29/07/2011 21:35, Adam Sjøgren wrote:
>> On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote:
>>> It might make things clearer if someone from Debian could explain:
>> (I am not from Debian, but here is my take on it anyway:)
>>> (a) why a Creative Commons licence is an issue for you
>> One of the fundamental software freedoms is the freedom to change the
>> software¹.
>> The Debian Free Software Guidelines' definition of free software
>> includes this freedom².
>> So the "No Derivatives" variants of the Creative Commons licenses aren't
>> free by the DFSG definition.
>> (The GNU Free Documentation License on documents with invariant sections
>> is considered non-free by DFSG-standards as well, even if the invariant
>> sections are things that nobody would want to change.)
>> When a project of volunteers packages 29000+ thousand packages, I think
>> making a judgement call on whether it is okay that the license of a
>> couple of files does not live up to the guidelines is neigh impossible.
>> The answer to "Why would you want to?" is, because you might need to.
>> It is more obvious with programs and code than it is with database
>> entries, granted - but I guess the equivalent problem would be that the
>> licensor didn't want to fix a problem in such a database, and that
>> problem made the programs using it malfunction. It would be a pain if
>> you weren't allowed to fix the problem and distribute the fixed data
>> yourself, say, if "upstream" didn't want to include the fix for some
>> reason or another; maybe they happened to turn sour on the world/you -
>> stranger things have happened.
>> So, nobody is probably ever going to exercise that freedom in this
>> specific case, I think, but ignoring some of the freedoms in special
>> cases is infeasible for a project such as Debian.
>> This is just me trying to explain how I understand it, so take it with a
>> grain of salt, and swing by debian-legal³ for the experts.
> A specific example might help. About 5 years ago a release of the UniProt database (as plain text files) broke the Wisconsin (GCG) sequence analysis package. They introduced extremely long lines in a data file that everyone assumed was only maximum 80 characters.
> As GCG was closed source, the fix required a change to the UniProt files to either wrap or truncate the 'offending' records.
> The fix was not to distribute a change to the data of course, but to write and distribute a simple perl script that wrapped the long records.
> That was not a licensing issue - the content stays the same, the format is changed, no changed data is distributed. But it does illustrate that the database licensing does not prevent 'fixing' a database.
>>> (b) why you appear to consider a copy of a whole or part of a public
>>> biological database as part of an "operating system"
>> They are part of a package which is included in the Debian GNU/Linux
>> free operating system.
> I expect there are many problems that arise if data ... and documentation ... are considered to be software. For EMBOSS we didn't officially specify a license for the documentation but other packages probably do. It still worries me that some of our documentation files officially include GPL licensed (EMBOSS) source code but I did not like any of the alternative documentation licenses.

I don't understand the logic behind why data would be considered software, unless one is using a very fuzzy definition of 'software'.  Is this strictly a packaging issue, e.g. any data packaged with source makes it 'software'?  Or just the fact that such data is licensed?  Would a package of just data/docs (no code) be allowed?

>> (I personally think it would make sense to change to a Creative Commons
>> license that allows derivative works - Uniprot and others are going to
>> be the canonical source for the data anyway, so nothing will be lost by
>> them by doing that, as far as I can see.)
> Unlikely. The no-derivatives version is specifically there to prevent derivatives - for example Debian distributing a modified UniProt without permission.
> The ontologies are similar, but do allow for the use case of importing terms from one ontology into another if the ontology name is changed (and preferably if cross-references to the original are provided). Again, the need is to protect the integrity of the original ontology content so references to a GO term or a UniProt entry are clearly defined.
> This is essential for many of the public bioinformatics databases. Data and software are not the same in this context. I am curious whether documentation licensing raises any issues.
> Just my 2c worth
> Peter Rice

Maybe the best solution is to just package any data separately?  We have talked about setting up a 'biodata' repository for common datasets from all the Bio* projects.

Feel free to skip the rest of this, but:


I agree with Peter's point, Uniprot and other databases license data this way for very good (and well-intentioned) reasons. For the Bio* languages there are instances where we use such data as a fallback in case a newer version isn't immediately available (REBase and SO come to mind, and I think we have others), so we are likely in the same boat as EMBOSS.  

I had a long screed here, but I found some original sources for the discussion re: Uniprot and use of Creative Commons licensing that states the reasoning for why this is in place:


Note there is now a 'Database Protocol' (last link) that recommends a different license; that page nicely summarizes the history the whole Creative Commons licensing affair and the issues of using a Creative Commons license re: databases, mainly due to the issue Peter mentioned above, that databases != software.  Uniprot doesn't use this as of yet (so it doesn't solve the problem at hand), but it's possible this may change.



Christopher Fields
Senior Research Scientist
National Center for Supercomputing Applications
Institute for Genomic Biology
University of Illinois Urbana-Champaign
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801

More information about the EMBOSS mailing list