[Biopython-dev] More SwissProt inconsistencies

Peter biopython at maubp.freeserve.co.uk
Mon Jun 1 10:15:03 UTC 2009


On Sat, May 30, 2009 at 10:37 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> 1) A multi-line author list such as the following:
> ...
> is stored without newlines by Bio.SeqIO:
> ...
> but with newlines by Bio.SwissProt:
>
> To me, the Bio.SeqIO approach seems more reasonable. I think we should
> add a space though at places where there is a newline in the file.
>
> The same happens for multiline RL such as
>
> RL   (In) Baker M.J., Crush J.R., Humphreys L.R. (eds.);
> RL   Proceedings of the XVII international grassland congress,
> RL   pp.2:1033-1034, Dunmore Press, Palmerston North (1993).
>
> and for multiline RT lines such as
>
> RT   "Genome of the host-cell transforming parasite Theileria annulata
> RT   compared with T. parva.";
>
> This is stored by Bio.SeqIO as
>
> '"Genome of the host-cell transforming parasite Theileria annulatacompared with T. parva.";'
>
> and by Bio.SwissProt as
>
> '"Genome of the host-cell transforming parasite Theileria annulata\ncompared with T. parva.";'
>
> whereas I think that both should be stored as
>
> '"Genome of the host-cell transforming parasite Theileria annulata compared with T. parva.";'

I agree with you - the missing spaces when parsed with Bio.SeqIO are a
bug and should be fixed.

> 2) Comments in a references such as the following:
> RC   STRAIN=cv. VF36; TISSUE=Anther;
> are stored as a single string by Bio.SeqIO:
>>>> seq_record.annotations['references'][i].comment
> 'STRAIN=cv. VF36; TISSUE=Anther;'
> but as a list of (key, value) pairs by Bio.SwissProt:
> [('STRAIN', 'cv. VF36'), ('TISSUE', 'Anther')]
> Whereas I think both are reasonable, Bio.SeqIO drops the space between
> two (key, value) pairs if they are on two separate lines:
> RC   STRAIN=C57BL/6J;
> RC   TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;
> is stored as
>>>> seq_record.annotations['references'][i].comment
> 'STRAIN=C57BL/6J;TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;'
> I think we should add a space here, or just store these as (key, value) pairs as Bio.SwissProt is doing.
>
> Any objections or comments?

Maybe using a list of (key, value) pairs is more sensible, but it
would probably break the BioSQL loader (and be inconsistent with
reference objects from the GenBank/EMBL parser).  It would be
reasonable to add the space. This is a simple change which shouldn't
hurt anything.

Peter




More information about the Biopython-dev mailing list