[Biopython-dev] OBO parser & DAG

Wed Jan 8 21:59:00 UTC 2014

So it seems like we are debating minimal external dependencies vs.
maximizing functionality.

How about the following: the OBO file will be read into an independent,
basic digraph like Bartek's team has already constructed.

But we will also have the ability to transfer the biopython DAG into a
networkx DAG, so that anyone wishing to play elaborate games with the
ontology structure (as we do), can do so without re-inventing the wheel.

How does that sound?

One thing about networkx: I still really, really like it :), and we started
writing the digraph based on it because of this page:

http://biopython.org/wiki/Gene_Ontology#GO_Directed_Acyclic_Graph

But the fact that this spec using networkx has been written does not have
to commit us to this particular design.

On Wed, Jan 8, 2014 at 4:05 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org>wrote:

> Hi,
>
> I'll answer below (even though I do have a bad habit of top-posting my
> answers, sorry).
>
>
> On Wed, Jan 8, 2014 at 5:55 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>
> >I wrote about not using networkx as the main data structure for ontologies
>
>>
>> I understand your rationale, but I disagree with it, mainly for design
>> reasons.
>>
>> Ok. I guess I was a bit too brief with my explanation. We have considered
> using networkx and decided not to use it mainly because it was not very
> useful, and implementing what was neccessary for parsing was not an issue
> for Kamil. Networkx is currently not a dependency for biopython or for
> bio.phylo and it is not even listed as an "optional software"  along with
> reportlab and such (http://biopython.org/wiki/Download I guess it should
> after Eric's comment). My understaing is that this is a policy of thinking
> twice before adding something additional as a dependency, because we would
> need to care for compatibility with different networkx versions.  Taking
> Bio.Phylo as an example I do think it is a good policy to keep such
> libraries as optional if possible. Besides, we did not particularly liked
> the way digraphs are implemented in Networkx with a heavy use of
> dictionaries, as this might get slow for large dictionaries.
> Now to your specific points:
>
>
>> 1. Enrichment analysis is only one of many different applications that
>> can be performed with GO. Therefore, saying that features are unnecessary
>> because a particular use case does not require them should not be a design
>> consideration for a module that is intended for general use. Rather, a
>> generic package handling ontologies should be just that: generic, and
>> disengaged from any kind of application. Therefore, if your package is
>> intended for biopython the use-case (enrichment analysis) should be
>> decoupled from the parser + data structure.
>>
>> We were obviously tailoring this to our needs, but I have to disagree
> with your argument. Because of the reasons above, I think that we should
> use external digraph library _only_ if it is _necessary_ for the parsing
> and storing and it clearly isn't.
>
> For the separation of parsing and enrichment, we do  want to keep the
> parser separate from the enrichment analysis and I thought it was quite
> clear with the use of separate classes, but we are absolutely open to
> discuss how to organize these modules.
>
> If you take the parser as a separate module - using networkx is even less
> needed(really no need for a big graph manipulation lib here).
>
>
>> 2. The graph features that you wrote in Digraph exist in networkx anyway,
>> or am I missing something? So why not take advantage of nx instead of
>> redoing it even if it does have many redundant (for you) graph manipulation
>> & diagnostic features? Someone else may want to use these features,
>> including the graphics nx provides, etc.
>>
>
> Yes, the point is that parsing, storing a digraph is a simple thing and
> there is no need to add a large library for that. If there was a digraph
> library in biopython, it would be stupid not to use it, but I don't feel we
> need to add a dependency here.
>
>
>>
>>> However it would be very easy to make functions for converting our
>>> ontologies to networkx digraphs, either with or without gene annotations as
>>> additional attributes.
>>>
>>>
>> Well, the idea is actually to maintain ontologies as nx digraphs. Yes, I
>> agree there.
>>
>>
> That's also exactly what is done in the Bio.Phylo. We are planning to
> write a function analogous to Bio.Phylo._util.to_networkx() which would
> take a simple digraph obtained from parsing an OBO file and give you a
> networkx digraph with all the data for manipulation.
>
>
>
>>  As for support for different types of transitivity in relations of
>>> different type (as in your inference of ancestry for is_a and part_of
>>> relations) we are currently not supporting it, but after thinking about it,
>>> we will make a change to support this feature. Probably we will let the
>>> user to (optionally) define the transitivity between relationship types
>>> (i.e. is_a + part_of becomes part_of, etc).
>>>
>>> In general, it would be very helpful if you could give us some rough
>>> idea about your expected use cases. For example: are you expecting to
>>> modify the graphs in the networkx objects? What will you use the inferred
>>> ancestor lists for? So that the changes we make will be as useful to the
>>> community as possible.
>>>
>>
>>
>> The idea is that expected use cases should not impact the design of a
>> basic parser + data structure. In my lab, we are looking at inferred
>> ancestors lists to calculate semantic similarity, but it really doesn't
>> matter what we (or anyone) will end up using the GO module for. If you
>> provide enrichment analysis *on top* of the parser + data structure (as a
>> separate module), and we provide semantic similarity (again as a separate
>> module *on top* of the parser + data structure) those are nice bonuses. But
>> the parser + data structure should be as general as possible. That is:
>> include all the information in the OBO file, placed in a digraph structure
>> that can be comprehensively interrogated, visualized and manipulated (which
>> is what nx offers).
>>
>> I was unfortunately not very clear here. What I meant was that we were
> considering what is necessary for typical uses of ontologies were parsing,
> and accessing the terms. And I think that is valid in the sense that
> majority of users is treating Ontologies as read-only data (not that many
> biopython users are making their own ontologies, otherwise, it would have
> been implemented ages ago...).
>
> As for the second argument: I do fully agree that there should be some
> separation between ontology and annotation reading and any functionality
> "on-top" of it. But I think that this would be not a reasonable thing to do
> to include networkx as the main data structure. Currently there is only one
> library that biopython depends on and it is numpy. I do not see networkx as
> equaly important. I think that we should go the way paved by the bio.phylo
> and use the simple digraph (which already holds all the information from
> the OBO files afaik) for parsing output and convert it to networkx where
> necessary.
>
> best
> Bartek
> --
> Bartek Wilczynski
> ==================
> Institute of Informatics
> University of Warsaw
> http://www.mimuw.edu.pl/~bartek
>

-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.