[Biopython-dev] Uniprot XML parser on TrEmbl

Andrea Pierleoni andrea at biocomp.unibo.it
Thu Nov 11 16:08:58 UTC 2010


I finally found the time, and the 62Gb needed to test the TrEmbl database
in uniprot xml format.
the analisis ic currently going, but so far I've been able to parse 1
million entries out of 12 millions (it will go overnight...)

I've had just one problem with the entry: Q2LEH1_9ROSI
in the downloaded files, there are multiple organism name fields, one of
wich is empty:

...
  <organism evidence="EI1">
    <name type="scientific"></name>
    <name type="common">Populus tomentosa x P. bolleana) x P. tomentosa
var. truncat</name>
...

this part of the file is differentially reported on the uniprot server at:
http://www.uniprot.org/uniprot/Q2LEH1.xml

...
 <organism evidence="EI1">
  <name type="scientific">(Populus tomentosa x P. bolleana) x P. tomentosa
var. truncata</name>
...

now, given also the missing start parenthesis, I think there is an error
non the downloaded XML file.

I've attached a patch that should cope with this issue. I don't know if
there are more "errors" in the xml file.
the patch was made on the current version of biopython master branch on
github and is valid for commit  9363c3cdc5f51805f247.

Andrea
-------------- next part --------------
A non-text attachment was scrubbed...
Name: UniprotIO.patch
Type: /
Size: 610 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20101111/3f9a10ae/attachment.bin>


More information about the Biopython-dev mailing list