[Biopython-dev] Uniprot XML parser on TrEmbl

Andrea Pierleoni andrea at biocomp.unibo.it
Thu Nov 25 16:09:28 UTC 2010


> Hi Andrea,
>
> I *think* I have fixed the problem with empty names in the UniProt XML
> format, without affecting the unit tests, but I don't have the 62GB free
> to
> unpack uniprot_trembl.xml.gz to try it out:
>
> https://github.com/biopython/biopython/commit/bb971b2a7384d42d9a6e4994e59299a90e6cc700
>
> Would you be able to retest the trunk code on that please?
>

I've just completed a run on the 8Gb gzipped trembl file (I don't have the
free 62Gb either) an it was ok, with zero errors.
By the way it took just 2h 18m, the same time it took on the uncompressed
62Gb XML file. So it's definitely better not to decompress this file...


> I also changed the handling of the organism host (where present)
> in both the UniProt and SwissProt parsers to be more consistent.
good

> I've checked uniprot_sprot.dat still parses, but haven't tried the
> much bigger uniprot_trembl.dat from uniprot_trembl.dat.gz - so
> again, would you be able to retest the "swiss" text parser too?

I'll test this too and let you know.

>
> Many thanks,
>
> Peter
>
> P.S. Did you get any reply from UniProt about the apparent error in
> the Q2LEH1 record within uniprot_trembl.xml.gz?
>

Unfortunately not.

Andrea





More information about the Biopython-dev mailing list