[BioSQL-l] What should source_term_id in table seqfeature refer to?

Richard Holland holland at eaglegenomics.com
Tue Aug 11 09:22:41 UTC 2009


Ideally there would be two fields for source_term_id - one for the  
algorithm used to generate the data (e.g. BLAST, miRanda), the other  
for the source the data came from (e.g. Genbank, miRBase). These are  
two very distinct concepts and it is not easy to represent them  
successfully using a single ontology source_term_id field. So the only  
way round it if you need to represent both algorithm and source is to  
create your own ontology which is a cross-product of the two possible  
sets of values (triples would be good for this).

If you want to use only a single term, basically it's up to you  
whether you choose to annotate by algorithm (miRanda) or by source  
(miRBase). I expect the decision will rest on whether it is more  
important for you to know which features in your database were added  
locally and which came from a remote source, or if knowing the  
algorithm used to generate them is more important. Otherwise if both  
are important the cross-product triple approach is probably the only  
way to go.

cheers,
Richard

On 11 Aug 2009, at 10:09, Florian Mittag wrote:

> Hm, I should've mentioned my real concern. We're integrating all  
> kinds of data
> into the database and right now I want to import miRNA information  
> (sequences
> and target sites) from miRBase (http://microrna.sanger.ac.uk/sequences/ 
> ).
> The files I download from there specify "miRanda" as METHOD, so  
> should I use
> this as source term or miRBase?
>
> Thanks,
> - Florian
>
> On Tuesday, 11. August 2009 10:59, Richard Holland wrote:
>> The reason BJX does that is because the Genbank format has no
>> indication of where a feature came from. So, all there is to go on is
>> that it came from Genbank! This allows us to differentiate between
>> features on a sequence that were loaded from an original file, and  
>> new
>> features that have been added to the sequence in the db after it was
>> loaded (e.g. by running blast, blat etc. against some local data).
>>
>> On 11 Aug 2009, at 09:10, Florian Mittag wrote:
>>> Hi!
>>>
>>> I stumbled upon an old post from Hilmar:
>>>
>>> On Tue, 18 Mar 2003, Hilmar Lapp wrote:
>>>> type_term_id is supposed to reference an SO term. source is
>>>> supposed to
>>>> denote the 'method'  (BLAST, BLAT, sim4, genewise, whatnot), as far
>>>> as
>>>> my understanding goes. In the case of reading the features from a
>>>> GenBank feature table, assigning 'Genbank/EMBL/Swissprot' as the
>>>> source
>>>> (which is what the genbank, embl, and swissprot parsers do in
>>>> bioperl)
>>>> is maybe stretching the definition, but I don't have something
>>>> substantially better to offer.
>>>
>>> I inspected the database after I imported some Genbank files with
>>> BioJava, and
>>> I found that the source_term_id for the seqfeatures is always set to
>>> the ID
>>> of an automatically inserted term "Genbank" with definition "auto-
>>> generated
>>> by biojavax".
>>>
>>> I was wondering if there is anything new to the source_term_id.
>>>
>>> - Florian
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>
> -- 
> Dipl. Inf. Florian Mittag
> Universität Tuebingen
> WSI-RA, Sand 1
> 72076 Tuebingen, Germany
> Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/





More information about the BioSQL-l mailing list