[Biopython-dev] Merging Uniprot XML parser?

Fri Nov 5 17:53:50 UTC 2010

On Fri, Nov 5, 2010 at 4:43 PM, Andrea Pierleoni wrote:
>
> On Tue, Oct 19, 2010 at 4:54 PM, Peter wrote:
>> I've now merged this into the trunk (with a git rebase first so the
>> history is linear - no branch+merge), and Andrea has agreed to
>> retest it. Other testing and comments are most welcome.
>>
>> Peter
>>
>
>
> I've done a couple of testing, from the master biopython branch.
> The uniprot-xml parser successfully parsed the 2010_11 release
> of uniprot containing 522,019 entries.
>
> The plain text 'swiss' parser took 6 mins to parse the complete flatfile
> uniprot db on my system (python 2.6 on a macbook pro, core2duo).
> the uniprot-xml parser took 12 minutes to do the same task when using
> cElementTree and looks pretty good to me (compare this to the 8
> minutes I needed to download the gzipped db).

I think I have a slightly older version as it only has 519348 entries.
My timings using Python 2.6 on Mac OS X, using looping over the
file with Bio.SeqIO.parse() and incrementing a counter:

uniprot_sprot.fasta, 232 MB, 15s ("fasta")
uniprot_sprot.dat, 2.2 GB, 4m57s ("swiss")
uniprot_sprot.xml, 4.5 GB, 10m34s ("uniprot-xml")

Note the XML file is about twice the size of the plain text swiss
format file, and as you noted, takes about twice as long to parse.

> However it took more than 80 mins to do the same task using
> ElementTree. So be aware that the parser can turn very slow
> without the C library.
>
> I'm currently retesting also on TrEMBL, but I don't think there is going
> to be any problem.

OK - those files are about 10 times bigger, right?

> I have no idea of the performances with jython, and similar
> derivations of python, nor if it works.

The tests all pass with Jython 2.5.1 (running under Mac OS X),
and here are some timings:

uniprot_sprot.fasta, 232 MB, 21s ("fasta")
uniprot_sprot.dat, 2.2 GB, 8m34s ("swiss")
uniprot_sprot.xml, 4.5 GB, FAILED ("uniprot-xml")

The XML file failed almost immediately with this traceback:

Traceback (most recent call last):
  File "../count.py", line 13, in <module>
    for record in SeqIO.parse(open(filename), format_name):
  File "../count.py", line 13, in <module>
    for record in SeqIO.parse(open(filename), format_name):
  File "/Users/xxx/jython2.5.1/Lib/site-packages/Bio/SeqIO/UniprotIO.py",
line 80, in UniprotIterator
    for event, elem in ElementTree.iterparse(handle, events=("start", "end")):
  File "/Users/xxx/jython2.5.1/Lib/xml/etree/ElementTree.py", line 937, in next
    self._parser.feed(data)
  File "/Users/xxx/jython2.5.1/Lib/xml/etree/ElementTree.py", line 1245, in feed
    self._parser.Parse(data, 0)
  File "/Users/xxx/jython2.5.1/Lib/xml/parsers/expat.py", line 195, in Parse
    self._data.append(data)
	at java.util.Arrays.copyOf(Arrays.java:2882)
	at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
	at java.lang.StringBuilder.append(StringBuilder.java:119)
	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)

java.lang.OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space

Note this wasn't a simple out of memory error (the machine had GBs
free), rather it was heap space. That's a bit frustrating - but Kyle's
email suggests things could improve in the next Jython release.

Peter