[Biopython-dev] New: Uniprot XML parser

Tue Jul 27 14:04:01 UTC 2010

On Tue, Jul 27, 2010 at 2:50 PM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
>>
>> Hi Andrea,
>>
>> As you have probably noticed via github, I have been trying out your code.
>>
>> I noticed you hadn't implemented indexing support so I have done this on
>> my branch as a quick hack:
>>
>> http://github.com/peterjc/biopython/commits/uniprot
>
> good, are we going to continue developing on two separate branches/repos?
> if you want I can grant you acces to my repo, no problem, just to make
> things simpler...

Partly it was because you had some unrelated stuff on your uniprot branch
(something in the FASTA m10 parser - I'd be interested to see an example
file which triggered your change).

>>
>> What I want to be able to do is seek to the start of an <entry ...> in the
>> XML handle, and have the parser continue from that point. I've done this
>> by the nasty trick of extracting the record from the XML file as a string
>> (using the get_raw method of the index class), then adding the XML
>> header and footer to it, and then invoking your parser. There should
>> be a better way to do this, but I am not familiar enough with
>> ElementTree to see it right away. Can you improve on this?
>>
>
> well it can be done using ElementTree, maybe it will also be faster than
> using
> the re module (actually I don't know if the re module is used by etree).
> however using cElementTree, when possible, will improve performance.
> by using ElementTree we can also handle namespace,
> rteurning a valid uniprot XML file/string.

If you can do this via (c)ElementTree, without building a dummy XML
single record as a string in memory first, that would be worth trying.

>> I'd also like to have SeqFeature parsing done for the plain text "swiss"
>> parser as well, which can double as a cross check for your parser. Did you
>> look at my old patch? http://bugzilla.open-bio.org/show_bug.cgi?id=2235
>>
>
> yes I looked at it,

At some point I'll try the patch and test it against your UniProt XML
feature generation. If I recall correctly there were some special cases
with features at the very start of the protein which puzzled me. Hopefully
the XML descriptions are clearer.

> ... and Mauro build some unit testing to compare the results
> between the two parsers, take a look at Tests / test_Uniprot.py in my repo:
>
> http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py

I thought I tried your version of the test but the seq_tests_common function
compare_records seemed to strict...

>> We should also run a comparison test of the "swiss" plain text and
>> "uniprot" XML parsers on the full downloads of UniProtKB/Swiss-Prot
>> and/or UniProtKB/TrEMBL, see http://www.uniprot.org/downloads
>>
>
> I've succesfully tested the last version in my ranch on the current
> version of UniprotKB/Swiss-Prot.

Good.

> the main differences between the two formats will be the comment field,
> and I don't see how they can match, sincce they are very different from
> the two original uniprot files.
>
> any idea?

I avoided this issue in the test on my branch ;)

I think we should update the plain text parser and BioSQL wrapper to support
use the same nesting as BioPerl is using. i.e. Start by running
BioPerl to import
a record into BioSQL, and see how the comment ended up.

> just to be clear, are we going to call this parser format just  "uniprot" or
> "uniprot-xml"?

Another open question, I recall asking this on the open-bio cross project
mailing list, but can't find it in the archive. Maybe I just meant to write an
email and forgot? Do you remember this - I would have CC'd you.
Basically I don't have a strong view on "uniprot" versus "uniprot-xml" but
would like to agree this with BioPerl and EMBOSS.

Peter