[BioSQL-l] What should source_term_id in table seqfeature refer to?

Hilmar Lapp hlapp at gmx.net
Sat Aug 15 19:31:13 UTC 2009


On Aug 15, 2009, at 12:32 PM, Richard Holland wrote:

> [...]
> Case study:

Great, now we're getting somewhere :-)

> I download some seqs from Genbank. (Which then need to be annotated  
> as having come from Genbank, at the sequence level).

Note, as you say, *at the sequence level*. I.e., you would record this  
either using the bioentry's namespace (biodatabase), or a  
bioentry_qualifier_value annotation. I would choose the former, though  
since a bioentry can on only be in one namespace, it may not satisfy  
your needs.

> They already have some features on them (which need to be annotated  
> as having come from Genbank, at the feature level, but of an unknown  
> algorithm as Genbank doesn't specify how they were generated usually).

Right. The source term would indicate that GenBank provided them to  
you, and that that's all you know.

> I then run BLAST of those sequences against some local data, and  
> record my own features as a result. I also run BLAT, and again  
> record my own features.

BLAST and BLAT would now be the source terms.

> My colleague also runs BLAST of the same seqs against some data of  
> his own, and wants our combined feature results to be stored in the  
> same database. I want to be able to annotate all these new features  
> both with the algorithm used to generate them (BLAST or BLAT)

You use the source term for that.

> and who did it (myself or my colleague at the institute down the road)

Ah - that's provenance information, not the source as is normally  
referred to. BioSQL at present doesn't have an explicit provenance  
model, but you can still record provenance information through  
ontology-typed tag/value annotation in seqfeature_qualifier_value,  
with the terms coming from a provenance ontology (that you make up  
yourself or grab from somewhere else).

> , in addition to retaining the original features that came from  
> Genbank (and making sure they're annotated as such).

That shouldn't be a problem - certainly it's not for BioSQL.

> Hence I'd need a source attribute for the sequence (Genbank in this  
> case), a source attribute for each feature (Genbank, Me, or  
> Colleague X, in this case), and an algorithm/technique/protocol  
> attribute for each feature (BLAST or BLAT or 'don't know it just  
> came from Genbank' in this example).

Not quite - source really is what provided the feature to you, not who  
or when, or using which BLAST database, genome assembly, or how you  
parsed the results, etc etc. That's all provenance information.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================






More information about the BioSQL-l mailing list