[Biopython-dev] New: Uniprot XML parser

Andrea Pierleoni andrea at biocomp.unibo.it
Tue Jul 27 13:50:53 UTC 2010


>
> Hi Andrea,
>
> As you have probably noticed via github, I have been trying out your code.
>
> I noticed you hadn't implemented indexing support so I have done this on
> my branch as a quick hack:
>
> http://github.com/peterjc/biopython/commits/uniprot

good, are we going to continue developing on two separate branches/repos?
if you want I can grant you acces to my repo, no problem, just to make
things simpler...

>
> What I want to be able to do is seek to the start of an <entry ...> in the
> XML handle, and have the parser continue from that point. I've done this
> by the nasty trick of extracting the record from the XML file as a string
> (using the get_raw method of the index class), then adding the XML
> header and footer to it, and then invoking your parser. There should
> be a better way to do this, but I am not familiar enough with
> ElementTree to see it right away. Can you improve on this?
>

well it can be done using ElementTree, maybe it will also be faster than
using
the re module (actually I don't know if the re module is used by etree).
however using cElementTree, when possible, will improve performance.
by using ElementTree we can also handle namespace,
rteurning a valid uniprot XML file/string.

> I'd also like to have SeqFeature parsing done for the plain text "swiss"
> parser as well, which can double as a cross check for your parser. Did you
> look at my old patch? http://bugzilla.open-bio.org/show_bug.cgi?id=2235
>

yes I looked at it, and Mauro build some unit testing to compare the results
between the two parsers, take a look at Tests / test_Uniprot.py in my repo:

http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py


> We should also run a comparison test of the "swiss" plain text and
> "uniprot" XML parsers on the full downloads of UniProtKB/Swiss-Prot
> and/or UniProtKB/TrEMBL, see http://www.uniprot.org/downloads
>

I've succesfully tested the last version in my ranch on the current
version of
UniprotKB/Swiss-Prot.
the main differences between the two formats will be the comment field,
and I don't see how they can match, sincce they are very different from
the two original uniprot files.

any idea?

just to be clear, are we going to call this parser format just  "uniprot" or
"uniprot-xml"?


Andrea





More information about the Biopython-dev mailing list