[Biopython-dev] New: Uniprot XML parser

Tue Jul 27 16:37:59 UTC 2010

>> XML descriptions are clearer, but have some probvlem as well.
>> some features do not have a stat and end point. in this case I skipped
>> them.
>
> If you have some specific examples (IDs) to hand that would be useful.
>

try this:
http://www.uniprot.org/uniprot/Q8NE62.xml

the "error" refers to old '?' symbol in feature positions
it carries this feature:

<feature type="transit peptide" description="Mitochondrion"
status="potential">
    <location>
        <begin position="1"/>
        <end status="unknown"/>
    </location>
</feature>

I'm actually skipping al the features/comments carrying a 
status="unknown" attrib
in start or end positions, or both.

other examples:
3HIDH_DICDI
ADAM1_RAT
ADAM1_RAT
ADM1B_MOUSE
ADM1B_MOUSE
CARDH_CYNCA
CARDH_CYNCA
CHDH_HUMAN
COQ41_PARTE
COQ4_CHAGB
COQ4_LEIMA
COX11_DICDI
COX11_DICDI
COX16_NEUCR
...

I'm actually skipping all the features having a

>
> I agree we're not going to get 100% identical records.

good

>
> Perhaps you are using "schema" in a different way that I would. All the
> projects use the same schema (where I mean database tables), but
> there are differences in the details of how each file format gets parsed
> and ends up stored in those tables.

Yes I'm referring to data schema in general, not strictly the BioSQL schema.
I don't mean to change the BioSQL schema.

>
>> I think we have 3 choiches:
>>
>> 1) follow BioPerl whatever they does (could be good)
>> 2) try to define our rules (bad)
>> 3) set a defined open schema and propose it to BioSQL (good)
>
> If in (3) you mean we should have some clear examples of major file
> formats and how each field should end up in BioSQL, I agree. In the
> short to medium term I regard the bioperl-db mapping as the reference
> implementation (although their code does continue to change), i.e. (1).
>
> I found one of the threads I was thinking about in the archive,
> http://bioperl.org/pipermail/biosql-l/2010-January/001672.html
> http://bioperl.org/pipermail/bioperl-l/2010-January/031993.html
> http://bioperl.org/pipermail/open-bio-l/2010-January/000609.html

so does it make sens to follow their code and their change?
this would be valid just for BioPerl and BioPython.

>
>> In my parser I'm storing information from the comment as annotations
>> in the seqrecords, buinding annotation key on the basis of the XML
>> tree. this is a quick and dirty hack, but can be done much better.
>>
>> we could store complex comment field with XML, but I'm not incline
>> in using just a big XML string in the comment field.
>
> Some sorted of nested structure like a dictionary? Are you familiar
> with the Perl TagTree which is what BioPerl are using here. I think
> Richard Holland said (in the above linked thread) that BioJava just
> sticks the DE section as an XML string into their record object
> (and thus puts XML in the BioSQL database?).
>

I'm not familiar with the TagTree but I've looked at it when there was
the discussion, and I do not see any advantage on using this explicitly
on the db fields instead of an XML.
I would save an XML text on the DB easily readable by every language
and even humans. XML text can be also queried easily. Then I'd represent
this XML in a nested dictionary structure similar to the perl TagTree.
I don't know if there is any implementation in python about this...

>> Also keep in mind that the "comment" field is no longer called comments
>> in the uniprot web-site but "general annotations", so maybe it makes
>> sense
>> to store this data as annotation in some other place.
>
> Sounds sensible.

you can use XML here too, if needed.

Also by using XML, we could be able to store dictionary-containing seqrecords
in a BioSQL db. A big plus to me.

>
> I'll (re-)post that as a specific query on the open-bio-l mailing list...
>

it looks like anybody is agreeing with "uniprot-xml"

Andrea