[BioSQL-l] Treating GenBank source features as top level annotation

Wed Nov 18 13:13:05 UTC 2009

I agree completely with your interpretation of the "source" feature  
tag, and in fact what you outline below is what I implemented as a  
"SeqProcessor" module for use within the SymAtlas data integration  
project (BioPerl supports 'pipes' of I/O and processing modules, where  
the latter can modify the sequence objects coming out of the I/O  
module).

I'm not sure I would want to hard-code this behavior into the BioPerl  
genbank parser. However, it would be easy enough to code it into a  
processing module that comes standard with the distribution to the  
extent that it can be enabled as simply as a format variant to SeqIO.

It sounds useful enough that I guess I should post it to the BioPerl  
list ...

	-hilmar

On Nov 18, 2009, at 6:06 AM, Peter wrote:

> Hello all,
>
> Something we've just been discussing on the Biopython mailing list
> is a possible change to how we parse the source features in GenBank
> (or EMBL) files. This could have knock on implications for how we use
> BioSQL. For anyone interested, the thread is here:
> http://lists.open-bio.org/pipermail/biopython/2009-November/ 
> 005826.html
>
> The basic observation is that GenBank files do not have any extensible
> annotation block for the whole sequence. There are a few fields like
> the comment, organism and taxonomy - but nothing general and
> structured. Instead, it seems the NCBI etc decided to use the feature
> table for this task by inventing the "source" feature. In every single
> GenBank file I have ever seen with a source feature, there is only
> one feature of this type and it spans the full sequence.
>
> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
> plasmid pPCP1, complete sequence:
>
> source      1..9609
>             /organism="Yersinia pestis biovar Microtus str. 91001"
>             /mol_type="genomic DNA"
>             /strain="91001"
>             /db_xref="taxon:229193"
>             /plasmid="pPCP1"
>             /biovar="Microtus"
>
> (I reduced the white space for emailing). All of that information
> makes sense as annotation for the whole sequence. In fact, the
> "organism" entry is duplicated on the ORGANISM line in the
> GenBank header (and the SOURCE line too).
>
> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
> associated with a "source" feature in the seqfeature table.
>
> I am suggesting it could make more sense to store the "source"
> feature annotation at the sequence level, using instead the
> bioentry_qualifier_value and bioentry_dbxref tables.
>
> This is a slight shift from the origins of BioSQL as a schema to
> hold GenBank files - but to me at least it is more logical.
>
> What does everyone else think? Things work as they are...
> and "if it ain't broken don't fix it"?
>
> Peter
>
> [Even if Biopython changes its internal object structure to treat
> the "source" feature annotation as sequence level annotation,
> we *could* continue to use a "source" feature when loading
> GenBank files to/from BioSQL if required for compatibility with
> the other Bio* projects. It would be more work though. In any
> case, we'd also need to recreate a "source" feature when
> writing GenBank output files.]
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================