[Biopython-dev] Merging Uniprot XML parser?

Andrea Pierleoni andrea at biocomp.unibo.it
Fri Nov 5 18:09:08 UTC 2010


> I think I have a slightly older version as it only has 519348 entries.
> My timings using Python 2.6 on Mac OS X, using looping over the
> file with Bio.SeqIO.parse() and incrementing a counter:
>
> uniprot_sprot.fasta, 232 MB, 15s ("fasta")
> uniprot_sprot.dat, 2.2 GB, 4m57s ("swiss")
> uniprot_sprot.xml, 4.5 GB, 10m34s ("uniprot-xml")
>

my timings were without the counter :)

> Note the XML file is about twice the size of the plain text swiss
> format file, and as you noted, takes about twice as long to parse.
>

yes it's true, but iterating over the two files takes 18s for .dat one
and 38s for .xml one. the information retrieved is more or less
the same. the rest is overhead due to the XML file complexity.
however it's pretty fast anyway, at least with cElementTree.

>> I'm currently retesting also on TrEMBL, but I don't think there is going
>> to be any problem.
>
> OK - those files are about 10 times bigger, right?

it's currently 12 millions entries! so it's 24 times bigger (7.5Gb gzipped)
in fact I can't complete the test today. I'll keep you updated.


>
> Note this wasn't a simple out of memory error (the machine had GBs
> free), rather it was heap space. That's a bit frustrating - but Kyle's
> email suggests things could improve in the next Jython release.
>


Is the new Jython release coming soon? I'm really a newbie to jython,
so I don't think I can help with it. maybe it is safer for jython users to
use the
'swiss' parser until the new release came out, particularly if they have
performance issues.

Andrea





More information about the Biopython-dev mailing list