[BioSQL-l] Treating GenBank source features as top level annotation

Wed Nov 18 12:08:48 UTC 2009

BioJava's latest parsers do the following:

On read:

  SOURCE and ORGANISM top-level tags are completely ignored
  For each tag in each feature, including source:
    If it's a dbxref 
       If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all)
       Otherwise set dbxref as a feature CrossRef table entry
    If it's organism
       Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all)
    Otherwise
       All other tags get mapped as feature qualifier values, including the source feature

On write:

   SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence,
   All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature,
   The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags

The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. 

The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags.

cheers,
Richard

On 18 Nov 2009, at 11:06, Peter wrote:

> Hello all,
> 
> Something we've just been discussing on the Biopython mailing list
> is a possible change to how we parse the source features in GenBank
> (or EMBL) files. This could have knock on implications for how we use
> BioSQL. For anyone interested, the thread is here:
> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html
> 
> The basic observation is that GenBank files do not have any extensible
> annotation block for the whole sequence. There are a few fields like
> the comment, organism and taxonomy - but nothing general and
> structured. Instead, it seems the NCBI etc decided to use the feature
> table for this task by inventing the "source" feature. In every single
> GenBank file I have ever seen with a source feature, there is only
> one feature of this type and it spans the full sequence.
> 
> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
> plasmid pPCP1, complete sequence:
> 
> source      1..9609
>             /organism="Yersinia pestis biovar Microtus str. 91001"
>             /mol_type="genomic DNA"
>             /strain="91001"
>             /db_xref="taxon:229193"
>             /plasmid="pPCP1"
>             /biovar="Microtus"
> 
> (I reduced the white space for emailing). All of that information
> makes sense as annotation for the whole sequence. In fact, the
> "organism" entry is duplicated on the ORGANISM line in the
> GenBank header (and the SOURCE line too).
> 
> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
> associated with a "source" feature in the seqfeature table.
> 
> I am suggesting it could make more sense to store the "source"
> feature annotation at the sequence level, using instead the
> bioentry_qualifier_value and bioentry_dbxref tables.
> 
> This is a slight shift from the origins of BioSQL as a schema to
> hold GenBank files - but to me at least it is more logical.
> 
> What does everyone else think? Things work as they are...
> and "if it ain't broken don't fix it"?
> 
> Peter
> 
> [Even if Biopython changes its internal object structure to treat
> the "source" feature annotation as sequence level annotation,
> we *could* continue to use a "source" feature when loading
> GenBank files to/from BioSQL if required for compatibility with
> the other Bio* projects. It would be more work though. In any
> case, we'd also need to recreate a "source" feature when
> writing GenBank output files.]
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/