[Biopython-dev] KEGG API Wrapper

Fri Oct 26 02:52:42 UTC 2012

Hi Kevin,

Thanks for the documentation! That makes everything a lot clearer.
Overall I like the querying code and I think we should add it to Biopython.

I have a bunch of comments on the KEGG module, some on the existing code and some on the new querying code, see below. Most of these are trivial; some may need some further discussion. Perhaps could you let us know which of these comments you can address, and which ones you want to skip for now?

Once we converged with regards to the querying code and the documentation, I think we can import your version of the KEGG module into the main Biopython repository and add your chapter on KEGG to the main documentation, and continue from there on the parsers and the unit tests.

Many thanks!
-Michiel.

About the querying code:
----------------------------------

I would replace KEGG.query("list", KEGG.query("find", KEGG.query("conv", KEGG.query("link", KEGG.query("info", KEGG.query("get" by the functions KEGG.list, KEGG.find, KEGG.conv, KEGG.link, KEGG.info, and KEGG.get.

For list, find, conv, link, and info, instead of going through KEGG.generic_parser, I would return the result directly as a Python list.
In contrast, KEGG.get should return the handle to the results, not the data itself. So the _q function, instead of
  ...
  resp = urllib2.urlopen(req)
  data = resp.read()
  return query_url, data
have
  ...

  resp = urllib2.urlopen(req)

  return resp
Then the user can decide whether to parse the data on the fly with Bio.KEGG, or read the data line by line and pick up what they are interested in, or to get all data from the handle and save it in a file. Note that resp will have a .url attribute that contains the url, so you won't need the ret_url keyword.

About the parsers:

------------------------

I think that we should drop generic_parser. For link, find, conv, link, and info, parsing is trivial and can be done by the respective functions directly. For get, we already have an appropriate parser for some databases (compound, map, and enzyme), but it's easy to add parsers for the other databases.

For all parsers in Biopython, there is the question whether the record should store information in attributes (as is currently done in Bio.KEGG), or alternatively if the record should inherit from a dictionary and store information in keys in the dictionary. Personally I have a preference for a dictionary, since that allows us to use the exact same keys in the dictionary as is used in the file (e.g., we can use "CLASS" as a key, while we cannot use .class as an attribute since it is a reserved word, so we use .classname instead). But other Biopython developers may not agree with me, and to some extent it depends on personal preference. 

The parsers miss some key words. The ones I noticed are ALL_REAC, REFERENCE, and ORTHOLOGY. Probably we'll find more once we extend the unit tests.

Remove the ';' at the end of each term in record.classname.

Convert record.genes to a dictionary for each organism. So instead of
[('HSA', ['5236', '55276']), ('PTR', ['456908', '461162']), ('PON', ['100190836', '100438793']), ('MCC', ['100424648', '699401']...
have
{'HSA': ['5236', '55276'], 'PTR': ['456908', '461162'], 'PON': ['100190836', '100438793'], 'MCC': ['100424648', '699401'], ...

Also for record.dblinks, record.disease, record.structures, use a dictionary.

In record.pathway, all entries start with 'PATH'. Perhaps we should check with KEGG if there could be anything else than 'PATH' there, otherwise I don't see the reason why it's there. Assuming that there could be something different there, I would also use a dictionary with 'PATH' as the key.

In record.reaction, some chemical names can be very long and extend over multiple lines. In such cases, the continuation line starts with a '$'. The parser should remove the '$' and join the two lines.

About the tests:

--------------------

We should update the data files in Tests/KEGG. This will fix some "bugs" in these data files.

We should switch test_KEGG.py to the unit test framework.

We should do some more extensive testing to make sure we are not missing some key words.

About the documentation:
---------------------------------
It's great that we now have some documentation.

On page 233, I would suggest to replace the "id_" by "accession" or something else, since the underscore in "id_" may look funky to new users.

Also it may be better not to reuse variable names (e.g. "pathway" is used in three different ways in the example). It's OK of course in general, but for this example it may be more clear to distinguish the different usages of this variable from each other.

For repair_genes, you can use a set instead of a list throughout.

--- On Wed, 10/24/12, Kevin Wu <kjwu at ucsd.edu> wrote:

From: Kevin Wu <kjwu at ucsd.edu>
Subject: Re: [Biopython-dev] KEGG API Wrapper
To: "Peter Cock" <p.j.a.cock at googlemail.com>, "Zachary Charlop-Powers" <zcharlop at mail.rockefeller.edu>, "Michiel de Hoon" <mjldehoon at yahoo.com>
Cc: Biopython-dev at lists.open-bio.org
Date: Wednesday, October 24, 2012, 6:38 PM

Hi All,
Thanks for the comments, I've written a bit of documentation on the entire KEGG module and have attached those relevant pages to the email. There didn't seem like an appropriate place for examples, so I just added a new chapter. I've also committed the updated file to github.

I did leave out the parsers due to the fact that the current parsers only cover a small portion of possible responses from the api. Also, I'm not confident that the some of the parsers correctly retrieves all the fields. However, I've written a really general parser that does a rough job of retrieving fields if it's a database format returned since I find myself reusing the code for all database formats. It's possible to modify this to correctly account for the different fields, but would probably take a bit of work to manually figure each field out. Otherwise it also parses the tsv/flat file returned.

Also, @zach, thanks for checking it out and testing it!
Thanks All!Kevin
On Wed, Oct 17, 2012 at 4:09 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

On Wed, Oct 17, 2012 at 12:55 AM, Zachary Charlop-Powers

<zcharlop at mail.rockefeller.edu> wrote:

> Kevin,

> Michiel,

>

> I just tested Kevin's code for a few simple queries and it worked great. I

> have always liked KEGG's organization of data and really appreciate this

> RESTful interface to their data; in some ways I think it easier to use the

> web interfaces for KEGG than it is for NCBI. Plus the KEGG coverage of

> metabolic networks is awesome.  I found the examples in Kevin's test script

> to be fairly self-explanatory but a simple-spelled out example in the

> Tutorial would be nice.

>

> One thought, though, is that you can retrieve MANY different types of data

> from the KEGG Rest API - which means that the user will probably have to

> parse the data his/herself. Data retrieved with "list" can return lists of

> genes or compounds or organism and after a  cursory look  these are each

> formatted differently. Also true with the 'find' command. So I think you

> were right to leave out parsers because i think they will be a moving target

> highly dependent on the query.

>

> Thank You Kevin,

> zach cp

Good point about decoupling the web API wrapper and the parsers -

how the Bio.Entrez module and Bio.TogoWS handle this is to return

handles for web results, which you can then parse with an appropriate

parser (e.g. SeqIO for GenBank files, Medline parser, etc).

Note that this is a little more fiddly under Python 3 due to the text

mode distinction between unicode and binary... just something to

keep in the back of your mind.

Peter

_______________________________________________

Biopython-dev mailing list

Biopython-dev at lists.open-bio.org

http://lists.open-bio.org/mailman/listinfo/biopython-dev