[Biopython-dev] New: Uniprot XML parser

Andrea Pierleoni andrea at biocomp.unibo.it
Thu Jan 21 12:01:30 UTC 2010


>> Anyhow, unit testing is coming (thanks to Mauro) together with a
>> detailed
>> comparison between the two parsed seqrecords.
>
> Great.
>
> Peter
>


As mentioned earlier, Mauro did a code review and added unit test for the
parser in Tests/test_Uniprot.py
the updated version is available on the github repository:
http://github.com/apierleoni/biopython

Since this version is mature enough I sepnt some time comparing the input
from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser.
This comparison was done using the Q13639 UniProt entry.

This are the main differences between the two generated SeqRecords:

- id:  is the same (first accession)
- name: is the same
- description: UP reports the  the recommended name , full name value, while
       additional names and synonyms are in the annotations. SP reports a
       long string containing everything parsed as it is form the plain
       text.
- dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed,
       NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs
- seq: is the same
- features: missing in SP (I have to check with the Peter's patch)
- annotations:
- - identical annotations: accessions, keywords, taxonomy, organism
- - mapped annotations:
       date_last_annotation_update in UP---> modified in SP
       date_last_sequence_update in UP---> sequence_modified in SP
       gene_name_primary in UP---> gene_name in SP
               >>> SP.annotations['gene_name']
               'Name=HTR4;'
               >>> UP.annotations['gene_name_primary']
               'HTR4'
       ncbi_taxid in SP ---> UP dbxrefs since it is mapped as a
                dbReference in the xmlfile
- - references: has some minor differences.
        Final semicolon and double quote missing in UP for both author
            and title fields.
        In UP reference comments are reported as:
	    "PublicationType | PublicationDate | Scope | Tissue"
	For submission publication type the db is reported in comments
            and not in journal field.
- - comments: here comes the big differences.
       SP has comments are on a single string.
       UP comments are mapped to seceral annotation entries using comment
          type and attributes to build the annotation key.
          Eg.
          comment_function --> list of  "function" type comment strings
          comment_subcellularlocation_location --> list of  "location"
               strings in the subcellularlocation comment field

       Comments  tree in XML would be easily mapped to a comment dictionary
       tree, but this would not be BioSQL safe.


Andrea




More information about the Biopython-dev mailing list