[Biopython] Additions to the SeqRecord

Tue Nov 17 14:53:44 UTC 2009

Peter wrote:
>>
>> Regarding the special case of the source feature in GenBank files, for
>> tasks like removing part of the record, or doing an origin shift, you may
>> want to recreate a new source feature reusing the old source feature
>> annotation (e.g. NCBI taxon ID). However, the location would have to
>> reflect the new modified sequence length.
>>
>> I have another idea to "solve" this problem:
>>
>> I am actually be tempted to remove the source SeqFeature, and instead
>> handle it via the annotations dict. To me this seems more natural than
>> having it as an entry in the feature table - a GenBank file format choice I
>> never really understood. My guess is they didn't want to introduce a record
>> level extensible annotation header block, which is what the source feature
>> could be regarded as handling.
>>
>> i.e. When parsing a GenBank (or EMBL) file, the source feature information
>> could get stored in the SeqRecord annotations dictionary. When writing to
>> GenBank (or in future EMBL) format, if the annotations dictionary contained
>> relevant fields, we would generate a source feature for the full sequence.
>>
>> Does that make sense? It requires looking at the source feature not as
>> a feature which happens to span the whole sequence, but as annotation
>> for the whole sequence (which happens to be in the GenBank features
>> table due to a historical choice or accident).

Brad Chapman wrote:
>
> I like that. You're right that those full length features are really
> annotations in disguise.

Good :)

> Instead of removing the source SeqFeature,
> I would suggest making it available in both places. This way you
> mimic what GenBank is doing, but also make it available in a more
> accessible and natural place. So for something like:
>
>     source          1..4411532
>                     /organism="Mycobacterium tuberculosis H37Rv"
>                     /mol_type="genomic DNA"
>                     /strain="H37Rv"
>                     /db_xref="taxon:83332"
>
> you would have the source SeqFeature, but also the organism,
> mol_type and strain in the annotations dictionary, and the cross
> reference in dbxrefs. Nice idea.

Good point about the dbxrefs - that makes sense :)

Interesting idea about having the parser record the source feature
in both the SeqFeature (as it does now) and the SeqRecord
annotations dict (as I suggested). That would certainly make sense
in the short term for a transition period, but in the long term we
should deprecate using a source SeqFeature. After all, for accessing
this information "There should be one-- and preferably only one --
obvious way to do it" (Zen of Python). This also applies to the code
for writing out GenBank files - if the information is in two places,
which takes priority?

Peter