[Biopython-dev] Uniprot XML parser on TrEmbl

Thu Nov 11 16:45:43 UTC 2010

On Thu, Nov 11, 2010 at 4:08 PM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
> I finally found the time, and the 62Gb needed to test the TrEmbl database
> in uniprot xml format.

Is that the size on disk of the XML file? 62GB is a lot.

> the analisis ic currently going, but so far I've been able to parse 1
> million entries out of 12 millions (it will go overnight...)
>
> I've had just one problem with the entry: Q2LEH1_9ROSI
> in the downloaded files, there are multiple organism name fields, one of
> wich is empty:
>
> ...
>  <organism evidence="EI1">
>    <name type="scientific"></name>
>    <name type="common">Populus tomentosa x P. bolleana) x P. tomentosa
> var. truncat</name>
> ...
>
> this part of the file is differentially reported on the uniprot server at:
> http://www.uniprot.org/uniprot/Q2LEH1.xml
>
> ...
>  <organism evidence="EI1">
>  <name type="scientific">(Populus tomentosa x P. bolleana) x P. tomentosa
> var. truncata</name>
> ...
>
> now, given also the missing start parenthesis, I think there is an error
> non the downloaded XML file.

It sounds like it - have you told UniProt?

> I've attached a patch that should cope with this issue. I don't know if
> there are more "errors" in the xml file.
> the patch was made on the current version of biopython master branch on
> github and is valid for commit  9363c3cdc5f51805f247.
>
> Andrea

Checked in, thanks:
https://github.com/biopython/biopython/commit/38da3ff264fe180e903cda4c143a7aa9be3d431a

Peter