[EMBOSS] Files included in EMBOSS but licensed ...

Chris Fields cjfields at illinois.edu
Sat Jul 30 19:01:58 UTC 2011


On Jul 30, 2011, at 3:58 AM, Peter Rice wrote:

> Quoted in full for the benefit of the debian-med list who missed the original posting
> 
> On 29/07/2011 21:35, Adam Sjøgren wrote:
>> On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote:
>> 
>>> It might make things clearer if someone from Debian could explain:
>> 
>> (I am not from Debian, but here is my take on it anyway:)
>> 
>>> (a) why a Creative Commons licence is an issue for you
>> 
>> One of the fundamental software freedoms is the freedom to change the
>> software¹.
>> 
>> The Debian Free Software Guidelines' definition of free software
>> includes this freedom².
>> 
>> So the "No Derivatives" variants of the Creative Commons licenses aren't
>> free by the DFSG definition.
>> 
>> (The GNU Free Documentation License on documents with invariant sections
>> is considered non-free by DFSG-standards as well, even if the invariant
>> sections are things that nobody would want to change.)
>> 
>> When a project of volunteers packages 29000+ thousand packages, I think
>> making a judgement call on whether it is okay that the license of a
>> couple of files does not live up to the guidelines is neigh impossible.
> 
>> The answer to "Why would you want to?" is, because you might need to.
>> 
>> It is more obvious with programs and code than it is with database
>> entries, granted - but I guess the equivalent problem would be that the
>> licensor didn't want to fix a problem in such a database, and that
>> problem made the programs using it malfunction. It would be a pain if
>> you weren't allowed to fix the problem and distribute the fixed data
>> yourself, say, if "upstream" didn't want to include the fix for some
>> reason or another; maybe they happened to turn sour on the world/you -
>> stranger things have happened.
>> 
>> So, nobody is probably ever going to exercise that freedom in this
>> specific case, I think, but ignoring some of the freedoms in special
>> cases is infeasible for a project such as Debian.
>> 
>> This is just me trying to explain how I understand it, so take it with a
>> grain of salt, and swing by debian-legal³ for the experts.
> 
> A specific example might help. About 5 years ago a release of the UniProt database (as plain text files) broke the Wisconsin (GCG) sequence analysis package. They introduced extremely long lines in a data file that everyone assumed was only maximum 80 characters.
> 
> As GCG was closed source, the fix required a change to the UniProt files to either wrap or truncate the 'offending' records.
> 
> The fix was not to distribute a change to the data of course, but to write and distribute a simple perl script that wrapped the long records.
> 
> That was not a licensing issue - the content stays the same, the format is changed, no changed data is distributed. But it does illustrate that the database licensing does not prevent 'fixing' a database.
> 
>>> (b) why you appear to consider a copy of a whole or part of a public
>>> biological database as part of an "operating system"
>> 
>> They are part of a package which is included in the Debian GNU/Linux
>> free operating system.
> 
> I expect there are many problems that arise if data ... and documentation ... are considered to be software. For EMBOSS we didn't officially specify a license for the documentation but other packages probably do. It still worries me that some of our documentation files officially include GPL licensed (EMBOSS) source code but I did not like any of the alternative documentation licenses.

I don't understand the logic behind why data would be considered software, unless one is using a very fuzzy definition of 'software'.  Is this strictly a packaging issue, e.g. any data packaged with source makes it 'software'?  Or just the fact that such data is licensed?  Would a package of just data/docs (no code) be allowed?

>> (I personally think it would make sense to change to a Creative Commons
>> license that allows derivative works - Uniprot and others are going to
>> be the canonical source for the data anyway, so nothing will be lost by
>> them by doing that, as far as I can see.)
> 
> Unlikely. The no-derivatives version is specifically there to prevent derivatives - for example Debian distributing a modified UniProt without permission.
> 
> The ontologies are similar, but do allow for the use case of importing terms from one ontology into another if the ontology name is changed (and preferably if cross-references to the original are provided). Again, the need is to protect the integrity of the original ontology content so references to a GO term or a UniProt entry are clearly defined.
> 
> This is essential for many of the public bioinformatics databases. Data and software are not the same in this context. I am curious whether documentation licensing raises any issues.
> 
> Just my 2c worth
> 
> Peter Rice
> EMBOSS Team


Maybe the best solution is to just package any data separately?  We have talked about setting up a 'biodata' repository for common datasets from all the Bio* projects.

Feel free to skip the rest of this, but:

<my_2c>

I agree with Peter's point, Uniprot and other databases license data this way for very good (and well-intentioned) reasons. For the Bio* languages there are instances where we use such data as a fallback in case a newer version isn't immediately available (REBase and SO come to mind, and I think we have others), so we are likely in the same boat as EMBOSS.  

I had a long screed here, but I found some original sources for the discussion re: Uniprot and use of Creative Commons licensing that states the reasoning for why this is in place:

http://wiki.creativecommons.org/Case_Studies/Uniprot
http://eric.jain.name/2006/02/07/uniprot-creative-commons/
http://sciencecommons.org/resources/faq/databases/
http://sciencecommons.org/resources/faq/database-protocol/

Note there is now a 'Database Protocol' (last link) that recommends a different license; that page nicely summarizes the history the whole Creative Commons licensing affair and the issues of using a Creative Commons license re: databases, mainly due to the issue Peter mentioned above, that databases != software.  Uniprot doesn't use this as of yet (so it doesn't solve the problem at hand), but it's possible this may change.

</my_2c>

chris

Christopher Fields
Senior Research Scientist
National Center for Supercomputing Applications
Institute for Genomic Biology
University of Illinois Urbana-Champaign
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801





More information about the EMBOSS mailing list