[Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt

Peter Cock p.j.a.cock at googlemail.com
Tue Apr 21 12:04:44 UTC 2009


On Tue, Apr 21, 2009 at 12:55 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
>> Have you got a link for the full record in your example?
>>
> You can find it here:
>
> http://www.uniprot.org/uniprot/Q9XHP0.txt
>
>> For interaction with other Bio.SeqIO formats, I generally
>> expect the description to be a single line string (with no
>> embedded newlines).
>
>> It looks like the SwissProt format has changed, and we
>> should be parsing the new extended DE lines more
>> carefully, and splitting these entries up and recording
>> them in the SeqRecord.annotations dictionary?
>>
> That sounds reasonable. The dictionary will have to be nested though. Something like this:
>
> annotations["RecName"] = [{"Full=11S globulin seed storage protein 2"]
> annotations["AltName"] = ["Full=11S globulin seed storage protein II", "Full=Alpha-globulin"]
> annotations["Contains"] = [{"RecName": {"Full": "11S globulin seed storage protein 2 acidic chain"}},
>                            "AltName": {"Full": "Full=11S globulin seed storage protein II acidic chain"}},
>                           {"RecName": {"Full": "11S globulin seed storage protein 2 basic chain"}},
>                            "AltName": {"Full": "Full=11S globulin seed storage protein II basic chain"}},
>                          ]
> annotations["Flags"] = "Precursor"
>

Possible - but for BioSQL we couldn't store those dictionaries.  A
list of strings should work, but isn't as elegant.  Maybe something
along these lines?

annotations["RecName"] = ["Full: 11S globulin seed storage protein 2;"}]
annotations["AltName"] = ["Full: 11S globulin seed storage protein
II", "Full: Alpha-globulin"]
annotations["Contains"] = ["RecName: Full=11S globulin seed storage
protein 2 acidic chain;\nAltName: Full=11S globulin seed storage
protein II acidic chain;",
"RecName: Full=11S globulin seed storage protein 2 basic
chain;\nAltName: Full=11S globulin seed storage protein II basic
chain;"]
annotations["Flags"] = "Precursor"

Or for "Contains" just have a flat list of strings, one for each name
(here four names).
Or for "Contains" just drop the AltName entries, and simply have a
list of the RecName entries (here two names).

Peter




More information about the Biopython-dev mailing list