[BioSQL-l] Treating GenBank source features as top level annotation
Hilmar Lapp
hlapp at gmx.net
Wed Nov 18 13:13:05 UTC 2009
I agree completely with your interpretation of the "source" feature
tag, and in fact what you outline below is what I implemented as a
"SeqProcessor" module for use within the SymAtlas data integration
project (BioPerl supports 'pipes' of I/O and processing modules, where
the latter can modify the sequence objects coming out of the I/O
module).
I'm not sure I would want to hard-code this behavior into the BioPerl
genbank parser. However, it would be easy enough to code it into a
processing module that comes standard with the distribution to the
extent that it can be enabled as simply as a format variant to SeqIO.
It sounds useful enough that I guess I should post it to the BioPerl
list ...
-hilmar
On Nov 18, 2009, at 6:06 AM, Peter wrote:
> Hello all,
>
> Something we've just been discussing on the Biopython mailing list
> is a possible change to how we parse the source features in GenBank
> (or EMBL) files. This could have knock on implications for how we use
> BioSQL. For anyone interested, the thread is here:
> http://lists.open-bio.org/pipermail/biopython/2009-November/
> 005826.html
>
> The basic observation is that GenBank files do not have any extensible
> annotation block for the whole sequence. There are a few fields like
> the comment, organism and taxonomy - but nothing general and
> structured. Instead, it seems the NCBI etc decided to use the feature
> table for this task by inventing the "source" feature. In every single
> GenBank file I have ever seen with a source feature, there is only
> one feature of this type and it spans the full sequence.
>
> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
> plasmid pPCP1, complete sequence:
>
> source 1..9609
> /organism="Yersinia pestis biovar Microtus str. 91001"
> /mol_type="genomic DNA"
> /strain="91001"
> /db_xref="taxon:229193"
> /plasmid="pPCP1"
> /biovar="Microtus"
>
> (I reduced the white space for emailing). All of that information
> makes sense as annotation for the whole sequence. In fact, the
> "organism" entry is duplicated on the ORGANISM line in the
> GenBank header (and the SOURCE line too).
>
> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
> associated with a "source" feature in the seqfeature table.
>
> I am suggesting it could make more sense to store the "source"
> feature annotation at the sequence level, using instead the
> bioentry_qualifier_value and bioentry_dbxref tables.
>
> This is a slight shift from the origins of BioSQL as a schema to
> hold GenBank files - but to me at least it is more logical.
>
> What does everyone else think? Things work as they are...
> and "if it ain't broken don't fix it"?
>
> Peter
>
> [Even if Biopython changes its internal object structure to treat
> the "source" feature annotation as sequence level annotation,
> we *could* continue to use a "source" feature when loading
> GenBank files to/from BioSQL if required for compatibility with
> the other Bio* projects. It would be more work though. In any
> case, we'd also need to recreate a "source" feature when
> writing GenBank output files.]
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the BioSQL-l
mailing list