[Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry?

Sun Mar 14 20:30:45 UTC 2010

On Fri, Mar 12, 2010 at 8:29 PM, Martin MOKREJŠ wrote:
>
> Finally, the remaining differences are here (probably the first is in bug #2578):
>
> --- /tmp/orig.gb        2010-03-12 21:09:24.000000000 +0100
> +++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100
> @@ -1,4 +1,4 @@
> -LOCUS       CR603932                1625 bp    mRNA    linear   HTC 16-OCT-2008
> +LOCUS       CR603932                1625 bp    DNA              HTC 16-OCT-2008
>  DEFINITION  full-length cDNA clone CS0DK007YH24 of HeLa cells Cot 25-normalized
>             of Homo sapiens (human).
>  ACCESSION   CR603932
> @@ -29,39 +29,39 @@
>             division of Invitrogen.
>  FEATURES             Location/Qualifiers
>      source          1..1625
> -                     /organism="Homo sapiens"
>                      /mol_type="mRNA"
> -                     /db_xref="taxon:9606"
>                      /clone="CS0DK007YH24"
> +                     /db_xref="taxon:9606"
>                      /tissue_type="HeLa cells Cot 25-normalized"
>                      /plasmid="pCMVSPORT_6"
> +                     /organism="Homo sapiens"
>  ORIGIN
>

Yes, the LOCUS line issue would be part of Bug 2578.

As to the order of the feature qualifiers, these are stored
in a Python dictionary which does not preserve the order.
I personally don't think the order of the qualifiers is
important and thus don't care that is can change like
this. Assuming the NCBI have a defined sort order for
the qualifiers (I'm not aware one), then we could sort
the feature qualifiers on output. Another option would
be to store the qualifiers in an ordered-dictionary. Or
just leave it as it is ;)

Peter