[Biopython-dev] New: Uniprot XML parser

Andrea Pierleoni andrea at biocomp.unibo.it
Fri Jan 15 10:35:39 UTC 2010


>
> My reasoning is it should be (almost) transparent for
> users to switch from parsing the plain text SwissProt
> files ("swiss") to the XML form.

This would be good

> There are also knock
> on implications for saving to BioSQL and file format
> conversions e.g. saving as a GenBank protein file
> (aka GenPept format).

The returned Seqrecords are actually BioSQL-safe,
since I can load them to a postgres biosql database.
formatting the actual Seqrecord with 'genbank' dbxrefs,
features, seq, keywords, source and names looks to be correctly
reported, while there is no trace of the other annotations.
I'll check it deeper.

>
> However, the comment parsing in the plain text "swiss"
> format is currently a little simplistic - partly to match
> what BioPerl did at the time. We can revisit that as
> part of this work.
>

the main problem here are going to be the comment fields, that in the
plain text predictors are parsed as a single string (this pushed me to
wrote the new parser). I tried to keep comments parsing as simple as it
can be, by just using lists of strings (good for BioSQL), but many comment
types would be better parsed with a dictionary tree.
As of now I left the option to get back the full XML for each comment, by
calling:

UniprotIO.UniprotIterator(handle,return_raw_comments=True)

so every info in the XML file can be returned and the end user can decide
how to parse those additional info.

Anyhow I think it is better to discuss this when the unit test
'swiss'VS'uniprot' is ready.

Andrea





More information about the Biopython-dev mailing list