[Biopython-dev] New: Uniprot XML parser

Tue Jul 27 15:16:00 UTC 2010

On Tue, Jul 27, 2010 at 3:55 PM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
>
>> At some point I'll try the patch and test it against your UniProt XML
>> feature generation. If I recall correctly there were some special cases
>> with features at the very start of the protein which puzzled me. Hopefully
>> the XML descriptions are clearer.
>>
>
> XML descriptions are clearer, but have some probvlem as well.
> some features do not have a stat and end point. in this case I skipped them.

If you have some specific examples (IDs) to hand that would be useful.

>>> ... and Mauro build some unit testing to compare the results
>>> between the two parsers, take a look at Tests / test_Uniprot.py in my
>>> repo:
>>>
>>> http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py
>>
>> I thought I tried your version of the test but the seq_tests_common
>> function compare_records seemed to strict...
>>
>
> I depends how how well we want to fit the plain-text vs xml parser.
> I don't think we could end up in 100% identical seqrecords, and some
> flexibility should be used.

I agree we're not going to get 100% identical records.

>> I think we should update the plain text parser and BioSQL wrapper to
>> support use the same nesting as BioPerl is using. i.e. Start by running
>> BioPerl to import a record into BioSQL, and see how the comment
>> ended up.
>>
>
> well, BioPerl guys weren't very collaborative on the BioSQL mailing list.
> however I just read a couple of messages at that time.
>
> they are using their schema and BioJava is not using the same schema.
> I don't know about other projects.

Perhaps you are using "schema" in a different way that I would. All the
projects use the same schema (where I mean database tables), but
there are differences in the details of how each file format gets parsed
and ends up stored in those tables.

> I think we have 3 choiches:
>
> 1) follow BioPerl whatever they does (could be good)
> 2) try to define our rules (bad)
> 3) set a defined open schema and propose it to BioSQL (good)

If in (3) you mean we should have some clear examples of major file
formats and how each field should end up in BioSQL, I agree. In the
short to medium term I regard the bioperl-db mapping as the reference
implementation (although their code does continue to change), i.e. (1).

I found one of the threads I was thinking about in the archive,
http://bioperl.org/pipermail/biosql-l/2010-January/001672.html
http://bioperl.org/pipermail/bioperl-l/2010-January/031993.html
http://bioperl.org/pipermail/open-bio-l/2010-January/000609.html

> In my parser I'm storing information from the comment as annotations
> in the seqrecords, buinding annotation key on the basis of the XML
> tree. this is a quick and dirty hack, but can be done much better.
>
> we could store complex comment field with XML, but I'm not incline
> in using just a big XML string in the comment field.

Some sorted of nested structure like a dictionary? Are you familiar
with the Perl TagTree which is what BioPerl are using here. I think
Richard Holland said (in the above linked thread) that BioJava just
sticks the DE section as an XML string into their record object
(and thus puts XML in the BioSQL database?).

> Also keep in mind that the "comment" field is no longer called comments
> in the uniprot web-site but "general annotations", so maybe it makes sense
> to store this data as annotation in some other place.

Sounds sensible.

>>> just to be clear, are we going to call this parser format just
>>>  "uniprot" or
>>> "uniprot-xml"?
>>
>> Another open question, I recall asking this on the open-bio cross project
>> mailing list, but can't find it in the archive. Maybe I just meant to write
>> an email and forgot? Do you remember this - I would have CC'd you.
>> Basically I don't have a strong view on "uniprot" versus "uniprot-xml" but
>> would like to agree this with BioPerl and EMBOSS.
>
>
> The issue here was that I started calling this format "uniprot" then I
> realize in the EBI REST services the file format is referred as
> "uniprot-xml". currently in my branch it is called uniprot-xml
>

I'll (re-)post that as a specific query on the open-bio-l mailing list...

Peter