[Biopython-dev] New: Uniprot XML parser

Andrea Pierleoni andrea at biocomp.unibo.it
Wed Sep 15 12:25:04 UTC 2010


>
> We could put the DB cross reference into the dbxrefs list, but that only
> captures a tiny part of the data. We could also put it in the annotations,
> but that loses the benefits of the position information. Maybe using a
> SeqFeature is the best plan...
>

it is to me


>
> On the other hand, "uniprot-xml" fits well with the idea of
> "format-variant".
> Whatever we go with will have downsides.

well I suppose we have to choose one. uniprot-xml is fine for me.

>
>> I'm still working on the SeqIO.index to make a faster implementation. RE
>> are really slow, and ElementTree should cope well with this task.
>> Anyhow it works with the current implementation, so it's not a big deal.
>
> I don't know enough about ElementTree to help right now, sorry.
>

Well I've reimplemented the _index UniprotDict function using ElementTree,
but it looks like this cannot be done using ElementTree. To iterate over an
XML file ElementTree uses a the iterparse function, that is able to
capture start
and end events for every tag.
unfortunately this event "capture" is not aligned with the parsing
process, meaning
that when a start event is raised the parser could be up to 16K ahead in
reading
the file, and the actual position is variable. See

http://mail.python.org/pipermail/xml-sig/2005-January/010838.html

Thus I cannot pick up the start position of the <entry> tag in the file.
The only way I found to make it work is going line by line, like you did
in your
implementation. We can use that one.

Andrea

implementation





More information about the Biopython-dev mailing list