[BioSQL-l] Treating GenBank source features as top level annotation

Hilmar Lapp hlapp at gmx.net
Wed Nov 18 14:28:01 UTC 2009


True - for chimeric sequences you can have multiple sources. That  
should be recognizable though from the length (and span) of the source  
feature location?

	-hilmar

On Nov 18, 2009, at 8:10 AM, Chris Fields wrote:

> Just to note, there are a few cases where there are two or more  
> source features.  This pops up mainly with chimeric sequences, for  
> example:
>
> http://www.ncbi.nlm.nih.gov/nuccore/21727885
>
> We have run into this a couple of times on the bioperl list.  In  
> this case, each feature is limited to specific locations on the  
> sequence and doesn't pertain to the entire sequence.  NCBI only  
> notes the first source on the ORGANISM line; last time I checked,  
> EMBL used both.
>
> chris
>
> On Nov 18, 2009, at 6:08 AM, Richard Holland wrote:
>
>> BioJava's latest parsers do the following:
>>
>> On read:
>>
>> SOURCE and ORGANISM top-level tags are completely ignored
>> For each tag in each feature, including source:
>>   If it's a dbxref
>>      If it's taxon, set the taxon ID in the BioEntry table (if no / 
>> taxon is specified in the source feature the taxonomy does not get  
>> stored at all)
>>      Otherwise set dbxref as a feature CrossRef table entry
>>   If it's organism
>>      Add the organism name to the taxon ID in the Taxon table using  
>> the scientific taxon name type (if no /organism tag is specified in  
>> the source feature, the taxon gets the default name from NCBI, but  
>> only if the NCBI taxonomy data is already present in BioSQL) (if  
>> no /taxon is specified in the source feature, then the taxonomy  
>> does not get stored at all)
>>   Otherwise
>>      All other tags get mapped as feature qualifier values,  
>> including the source feature
>>
>> On write:
>>
>>  SOURCE and ORGANISM tags are generated from the BioEntry taxon ID  
>> entry for the sequence,
>>  All features get qualifier values output plus /db_xref tags for  
>> all entries from the CrossRef table for the feature,
>>  The source feature is output as per a normal feature, plus / 
>> organism and /db_xref="taxon:..." tags generated as per the SOURCE  
>> and ORGANISM tags
>>
>> The main reason why we still use the source feature and don't go to  
>> sequence level is because when converting between formats it's hard  
>> to tell which sequence-level qualifier_values are from the source  
>> feature and which are from other places.
>>
>> The main reason why we rely entirely on the source feature for  
>> organism and taxon ID info is because it's much easier to parse  
>> than the SOURCE and ORGANISM tags.
>>
>> cheers,
>> Richard
>>
>> On 18 Nov 2009, at 11:06, Peter wrote:
>>
>>> Hello all,
>>>
>>> Something we've just been discussing on the Biopython mailing list
>>> is a possible change to how we parse the source features in GenBank
>>> (or EMBL) files. This could have knock on implications for how we  
>>> use
>>> BioSQL. For anyone interested, the thread is here:
>>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html
>>>
>>> The basic observation is that GenBank files do not have any  
>>> extensible
>>> annotation block for the whole sequence. There are a few fields like
>>> the comment, organism and taxonomy - but nothing general and
>>> structured. Instead, it seems the NCBI etc decided to use the  
>>> feature
>>> table for this task by inventing the "source" feature. In every  
>>> single
>>> GenBank file I have ever seen with a source feature, there is only
>>> one feature of this type and it spans the full sequence.
>>>
>>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
>>> plasmid pPCP1, complete sequence:
>>>
>>> source      1..9609
>>>           /organism="Yersinia pestis biovar Microtus str. 91001"
>>>           /mol_type="genomic DNA"
>>>           /strain="91001"
>>>           /db_xref="taxon:229193"
>>>           /plasmid="pPCP1"
>>>           /biovar="Microtus"
>>>
>>> (I reduced the white space for emailing). All of that information
>>> makes sense as annotation for the whole sequence. In fact, the
>>> "organism" entry is duplicated on the ORGANISM line in the
>>> GenBank header (and the SOURCE line too).
>>>
>>> Currently we (Biopython, BioPerl etc) store this annotation in  
>>> BioSQL
>>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
>>> associated with a "source" feature in the seqfeature table.
>>>
>>> I am suggesting it could make more sense to store the "source"
>>> feature annotation at the sequence level, using instead the
>>> bioentry_qualifier_value and bioentry_dbxref tables.
>>>
>>> This is a slight shift from the origins of BioSQL as a schema to
>>> hold GenBank files - but to me at least it is more logical.
>>>
>>> What does everyone else think? Things work as they are...
>>> and "if it ain't broken don't fix it"?
>>>
>>> Peter
>>>
>>> [Even if Biopython changes its internal object structure to treat
>>> the "source" feature annotation as sequence level annotation,
>>> we *could* continue to use a "source" feature when loading
>>> GenBank files to/from BioSQL if required for compatibility with
>>> the other Bio* projects. It would be more work though. In any
>>> case, we'd also need to recreate a "source" feature when
>>> writing GenBank output files.]
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>>
>>
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================






More information about the BioSQL-l mailing list