[Open-bio-l] RE: [GMOD-devel] What is the clear distinction between a feature and a bioentry

Hilmar Lapp hlapp@gnf.org
Thu, 23 May 2002 09:24:50 -0700


> -----Original Message-----
> From: Elia Stupka [mailto:elia@fugu-sg.org]
> Sent: Thursday, May 23, 2002 5:41 AM
> To: Hilmar Lapp; GMOD Devel (E-mail); OBDA BioSQL (E-mail)
> Cc: td2@sanger.ac.uk; Ewan Birney (E-mail); cjm@fruitfly.org
> Subject: Re: [GMOD-devel] What is the clear distinction between a
> feature and a bioentry
> 
> 
> > Bioentry vs. Feature: we decided that everything that
> > - lives in a namespace (biodatabase), and
> > - has a stable accession and/or ID, and
> > - has a sequence (physically in the database or not)
> > shall be a Bioentry. Features shall be essentially 
> lightweight objects.
> 
> Just wondering, wouldn't you want to treat a unigene cluster 
> as a bioentry
> even though it doesn't have a real sequence, but is just a 
> collection of
> sequences? We allow bioentries not to have sequences, and I 
> like that...

A Unigene cluster has a consensus sequence, so it would meet the definition. We're producing other sequence clusterings too, and they too have a consensus sequence. That may not generally hold though. The definition given above should be taken as a guideline for our local build. Bottom line is, if  2 or more of the 3 conditions are met, the 'right' thing to do should be to make it a bioentry. If only 1 or 0 are met, it's most likely a feature.

I didn't mean in any way that Bioentries be required to physically have a sequence in the database. They also may not necessarily be associated with a sequence virtually, in which case they should meet the other 2 requirements though.

I'm more than happy to hear (and adopt) other definitions of how you decide what goes where (and it does have consequences). I just think there needs to be a common definition as it's too arbitrary otherwise, and we (locally) needed something now in order to move forward.

> 
> > 1) As much as possible, Bioentries will be mapped down to 
> chromosomes,
> even if the
> >datasource only gives the coordinates to contigs. (I think 
> this also aligns
> it better with Lincoln's >DB:GFF view.) Contigs will be 
> retained in the
> database though, in case they are needed at some >time as an 
> entry point.
> 
> just wondering again, we are not saying all bioentries need 
> to be located on
> a genome, are we? I am just trying to make sure we keep it 
> generic and able
> to deal with much more than genome annotations...

Right. Many bioentries may not have a location at all, let alone on the genome. It rather means, if your data source provides mappings of bioentries to other bioentries, which are mapped to the chromosomes, do the math right away rather than on-the-fly when a query comes in. It doesn't really affect the schema.

It also states (well, not explicitly) that features will /not/ be mapped to the genome (see the exon example in 2)). They will be mapped to the bioentries on which they are located. Otherwise the genome must be represented as bioentries (given the present schema), which is ugly IMHO. Chromosomes don't live in a namespace (i.e., if Celera and Ensembl talk about Homo sapiens chromosome 1, they should mean the same thing. It's the taxon that distinguishes one chromosome 1 from another, not a namespace.).

Again, I'm more than happy to hear other opinions, and I'm convinced that lots of other people out there have 10x more experience with all this than I have. The mapping to a genome is, however, at this point lacking from Biosql, and we here really need to move forward.

> 
> > 2) According to the definition given above, Genes, transcripts, and
> proteins, will all go into >Bioentry. Exons will be features 
> (and therefore
> not directly mapped to chromosomes).
> 
> Genes as such don't have sequences, only their transcripts do, so it
> wouldn't fit the definition, unless you allow bioentries 
> without sequences

Right. Genes nevertheless live in a namespace (yet, as there is no universally accepted set of genes per organism), and they do have stable IDs. The concept of genes is quite hairy anyway; Serge next door tells me at Fantom2 there was an extended discussion between 20 scientists who had 19 differing opinions about what a gene is ...

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------