[Biopython-dev] New: Uniprot XML parser

Tue Jul 27 14:55:20 UTC 2010

> Partly it was because you had some unrelated stuff on your uniprot branch
> (something in the FASTA m10 parser - I'd be interested to see an example
> file which triggered your change).
>

yes, I know, about the FASTA parser, but actually that change did not fix
the problem, just get better. the m10 parser has problems when parsing
from glsearch output, but we could discuss that in a separe thread.

> If you can do this via (c)ElementTree, without building a dummy XML
> single record as a string in memory first, that would be worth trying.
>

yes it can be done, I'll put this in my work list.

>
> At some point I'll try the patch and test it against your UniProt XML
> feature generation. If I recall correctly there were some special cases
> with features at the very start of the protein which puzzled me. Hopefully
> the XML descriptions are clearer.
>

XML descriptions are clearer, but have some probvlem as well.
some features do not have a stat and end point. in this case I skipped them.

>> ... and Mauro build some unit testing to compare the results
>> between the two parsers, take a look at Tests / test_Uniprot.py in my
>> repo:
>>
>> http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py
>
> I thought I tried your version of the test but the seq_tests_common
> function
> compare_records seemed to strict...
>

I depends how how well we want to fit the plain-text vs xml parser.
I don't think we could end up in 100% identical seqrecords, and some
flexibility should be used.

>
> I avoided this issue in the test on my branch ;)
>
> I think we should update the plain text parser and BioSQL wrapper to
> support
> use the same nesting as BioPerl is using. i.e. Start by running
> BioPerl to import
> a record into BioSQL, and see how the comment ended up.
>

well, BioPerl guys weren't very collaborative on the BioSQL mailing list.
however I just read a couple of messages at that time.

they are using their schema and BioJava is not using the same schema.
I don't know about other projects.

I think we have 3 choiches:

1) follow BioPerl whatever they does (could be good)
2) try to define our rules (bad)
3) set a defined open schema and propose it to BioSQL (good)

In my parser I'm storing information from the comment as annotations
in the seqrecords, buinding annotation key on the basis of the XML
tree. this is a quick and dirty hack, but can be done much better.

we could store complex comment field with XML, but I'm not incline
in using just a big XML string in the comment field.

Also keep in mind that the "comment" field is no longer called comments
in the uniprot web-site but "general annotations", so maybe it makes sense
 to store this data as annotation in some other place.

>> just to be clear, are we going to call this parser format just
>>  "uniprot" or
>> "uniprot-xml"?
>
> Another open question, I recall asking this on the open-bio cross project
> mailing list, but can't find it in the archive. Maybe I just meant to
> write an
> email and forgot? Do you remember this - I would have CC'd you.
> Basically I don't have a strong view on "uniprot" versus "uniprot-xml" but
> would like to agree this with BioPerl and EMBOSS.

The issue here was that I started calling this format "uniprot" then I
realize
in the EBI REST services the file format is referred as "uniprot-xml".
currently in my branch it is called uniprot-xml

Andrea