[GMOD-devel] Re: [Open-bio-l] Schema for genes & features &mappings to assemblies

Lincoln Stein lstein@cshl.org
Wed, 1 May 2002 21:58:01 -0400


The way this is done in Bio::DB::GFF is via Homology blocks which
relate a range in one sequence to a range in another.  We (mis)use the
GFF Source field to indicate the evidence type.

Lincoln

Hilmar Lapp writes:
 > I like this proposal, but ...
 > 
 > > 9. project-centric column names like "chromosome" are avoided; eg
 > >  drosophila has chromosome arms as top level sequences
 > 
 > So, I'm still confused about how I am supposed to store gene predictions, EST, RefSeq, or whatever mappings to chromosomes in an assembly, such that I can answer queries like 'show me all exons of genes and their lines of evidence that map between markers X and Y on chromosome 5 of mouse. Next, show how human genes map to this region, and which human chromosomes.'
 > 
 > Maybe someone can help me lifting my confusion.
 > 
 > How is this done in GMOD and Ensembl, and how does that map to BioSQL with the assembly proposal below?
 > 
 > 	-hilmar
 > -- 
 > -------------------------------------------------------------
 > Hilmar Lapp                            email: lapp@gnf.org
 > GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
 > -------------------------------------------------------------
 > 
 > 
 > 
 > > -----Original Message-----
 > > From: Chris Mungall [mailto:cjm@bdgp.lbl.gov]
 > > Sent: Wednesday, April 24, 2002 6:31 PM
 > > To: Lincoln Stein
 > > Cc: Elia Stupka; Thomas Down; Ewan Birney; Hilmar Lapp; GMOD Devel
 > > (E-mail); OBDA BioSQL (E-mail)
 > > Subject: Re: [GMOD-devel] Re: [Open-bio-l] Schema for genes & features
 > > &mappings to assemblies
 > > 
 > > 
 > > 
 > > Here is an example of one way of doing things such that we can all
 > > agree to disagree yet remain one happy family.
 > > 
 > > It's not perfect, but I think it's better than the alternative which
 > > seems to be to solidify a compromise schema which no ones really happy
 > > with, or force everyone to use overcomplex adapters.
 > > 
 > > It's a component-based solution rather than a monolithic one, the SQL
 > > DDL follows the description below
 > > 
 > >  ------
 > > 
 > > 1. Definitions (up for debate)
 > >    1 level assembly - features all stored on top level seqs
 > >              assembly table may still be useful; eg for getting
 > >              entry units - or seqfeatures could be used instead,
 > >              e.g. like GGB
 > >    2 level assembly - e.g. contigs on a chromosome. unspecified as
 > >              to whether features live on contigs, or both levels
 > >    n level assembly - e.g. chroms, contigs, reads. unspecified as
 > >              to whether features live on mixed levels, and to whether
 > >              the depth is fixed or variable
 > > 
 > > 2. All client code can expect 2 relations to be present: assembly, and
 > >  dnafrag, defined below.
 > > 
 > > 3. Client code can assume 2 level assemblies by default. Adaptors
 > >  should take care of transformations and/or the lite-client-bridge
 > >  (see 5 below) can be used
 > > 
 > > 4. Client code written expecting flat assemblies (ie ignoring the
 > >  assembly relation altogether) won't break, but they
 > >  will display incomplete data (ie missing chrom features from contigs
 > >  or vice versa) IF the data is stored in a 2 level manner.
 > > 
 > > 5. An optional bridge layer is provided for lite-clients that 
 > > expect all
 > >  the data to be present in a flat assembly. This layer is sufficient
 > >  for read-only, but currently not for updates (although it could be
 > >  extended to do so). This layer is also useful for direct data
 > >  exploration via SQL.
 > > 
 > > 6. n-level assemblies are not assumed as default. Code 
 > > assuming n-level
 > >  assemblies will obviously work as n-level subsumes 1/2 level. An
 > >  n-level assembly component can be used, a view is used so that the
 > >  core 2-level assembly model is supported, although specialized
 > >  n-level assembly update code would be required.
 > > 
 > > 7. Views are utilized but this doesn't marginalise mysql pre 4.1 - the
 > >  views could be materialized in a read-only db, or they could act as
 > >  specifications for a programmatic adapter layer.
 > > 
 > > 8. The GGB sequence shredding idea is used, via the dnafrag 
 > > relation. This
 > >  is necessary for large seqs with mysql. If you're DBMS is happy with
 > >  large seqs, then you still have to support the dnafrag 
 > > relation, but you
 > >  can use a view with virtually no loss in speed.
 > > 
 > > 9. project-centric column names like "chromosome" are avoided; eg
 > >  drosophila has chromosome arms as top level sequences
 > > 
 > > ===========================
 > > 
 > > I have munged all the components into a single file with ifdefs here,
 > > in reality they would be in seperate component files.
 > > 
 > > These are the different builds possible:
 > > 
 > > core - a good choice for all metazoan s. this part should follow
 > > ensembl rather well. assumes that you are doing data management such
 > > that a 2 level assembly is beneficial.
 > > 
 > > smallseq - if either the genome consists of smallish unordered
 > > contigs, or the fully sequence genome has smallish chromosomes.
 > > 
 > > 1-level  - all the features are flattened onto the biggest seq units
 > > 
 > > n-level  - will require extra code to fully utilise this
 > > 
 > > None of the table/colnames are set in stone, this is just to give a
 > > flavour of a possible solution.
 > > 
 > > <ifdef core, 1-level-frag, smallseq>
 > > 
 > > # child seqs (eg clones/contigs) are expected to be
 > > # all on the fwd strand in this example
 > > 
 > > CREATE TABLE assembly (
 > >     assembly_id unsigned NOT NULL PRIMARY KEY auto_increment,
 > >     integer parentseq_id not null,
 > >     FOREIGN KEY parentseq_id REFERENCES seq(seq_id),
 > >     parent_start integer not null,
 > >     parent_end integer not null,
 > >     integer childseq_id not null,
 > >     FOREIGN KEY parentseq_id REFERENCES seq(seq_id),
 > >     child_start integer not null,
 > >     child_end integer not null
 > > );
 > > 
 > > <ifdef>
 > > 
 > > <ifdef n-level>
 > > 
 > > CREATE TABLE assembly_nlevel (
 > >     assembly_id unsigned NOT NULL PRIMARY KEY auto_increment,
 > >     integer parentseq_id not null,
 > >     foreign key parentseq_id references seq(seq_id),
 > >     parent_start integer not null,
 > >     parent_end integer not null,
 > >     integer childseq_id not null,
 > >     foreign key parentseq_id references seq(seq_id),
 > >     child_start integer not null,
 > >     child_end integer not null
 > > );
 > > 
 > > CREATE VIEW assembly AS
 > >   ....  <this is tricky - it depends on whether the level is fixed or
 > >   whether you can have mix and match 1, 2, 3 etc level in one db>
 > > 
 > > <ifdef>
 > > 
 > > <ifdef 1-level>
 > > 
 > > # for most genomes, it makes sense to 'shred' the sequence
 > > 
 > > # if you have a 1-level assembly (ie you have no need of
 > > # an assembly table) but your sequences are too big to
 > > # store directly, eg in mysql, then you will want to
 > > # use this table to store them in smaller chunks
 > > 
 > > # getting subsequences as fast as possible is something
 > > # that is core to all genome annotation databases, so this
 > > # relation is expected; it could be implemented differently,
 > > # see below.
 > > 
 > > # open question: how does the client decide when to use
 > > # dnafrag and when to use the biosequence table? Should
 > > # dnafrag be optional?
 > > 
 > > CREATE TABLE dnafrag (
 > >     integer seq_id not null,
 > >     foreign key seq_id references seq(seq_id),
 > >     integer fstart not null,
 > >     integer fend not null,
 > >     biosequence_str mediumtext not null
 > > );
 > > 
 > > <ifdef>
 > > 
 > > <ifdef core>
 > > 
 > > # use this component if you have a 2 or n level assemblies
 > > # and the top level sequences are too big for your DBMS to
 > > # handle well
 > > 
 > > # note; this is a slow implementation becuase of the
 > > # substring; we could easily do it without
 > > # and just extend the frag to include the full
 > > # child (eg clone) boundaries
 > > 
 > > # open question: can client code assume dna fragments are abutting /
 > > # have no overlap extent
 > > 
 > > # materialize the view for warehouse dbs for faster performance
 > > 
 > > CREATE VIEW dnafrag
 > >  AS SELECT parentseq_id AS seq_id
 > >            substring(sequence.biosequence_str,
 > >                      child_start,
 > >                      child_end) AS biosequence_str,
 > >            parent_start AS fstart,
 > >            child_start  AS fend
 > >  FROM assembly, sequence
 > >  WHERE sequence.sequence_id = assembly.childseq_id;
 > > 
 > > <ifdef>
 > > 
 > > <ifdef smallseq>
 > > 
 > > # if we have either a small genome, or
 > > # a big genome for which there is no assembly,
 > > # only unordered contigs of a small size
 > > # (small defined as whatever is a scalable seq
 > > #  size for your DBMS)
 > > # then it doesn't make sense to 'shred' into
 > > # manageable size pieces, but we should
 > > # support the interface/relation
 > > CREATE VIEW dnafrag
 > >  AS SELECT sequence_id AS seq_id
 > >            biosequence_str AS biosequence_str,
 > >            1 AS fstart,
 > >            seq_length  AS fend
 > >  FROM sequence;
 > > 
 > > <ifdef>
 > > 
 > > <ifdef gff-l2>
 > > 
 > > # lite-clients may want a simple GFF view of
 > > # the world, with everything in a flat coordinate
 > > # system. this view would be used if your features
 > > # were stored in the leaf nodes in your 2-level assembly;
 > > # other views could be made e.g. for features stored
 > > # on mixed levels
 > > 
 > > # this relation is intended to be conformant to
 > > # the GGB fdata relation
 > > 
 > > # this is slightly convoluted because of
 > > # the way sequences/locations work in biosql
 > > 
 > > # note the off-by-ones cancel eachother below
 > > CREATE VIEW fdata
 > >  AS SELECT seqfeature_id     AS fid,
 > >            parententry.accession       AS fref,
 > >            fl.seq_start + (a.parent_start - a.child_start)
 > >                       AS fstart,
 > >            fl.seq_end + (a.parent_start - a.child_start)
 > >                       AS fstop,
 > >            f.seqfeature_key_id   AS ftypeid,
 > >            NULL              AS fscore,
 > >            fl.seq_strand     AS fstrand,
 > >            NULL              AS fphase,
 > >            f.seqfeature_id   AS gid,
 > >            NULL              AS ftarget_start,
 > >            NULL              AS ftarget_stop
 > >     FROM seqfeature f,
 > >          seqfeature_location fl,
 > >          assembly a,
 > >          bioentry childentry,
 > >          bioentry parententry,
 > >          biosequence childseq,
 > >          biosequence parentseq,
 > >     WHERE
 > >           a.childseq_id    = childseq.sequence_id                AND
 > >           childseq.bioentry_id    = childentry.bioentry_id       AND
 > >           a.parentseq_id    = parentseq.sequence_id                AND
 > >           parentseq.bioentry_id    = parententry.bioentry_id       AND
 > >           fl.seqfeature_id = f.seqfeature_id                     AND
 > >           f.bioentry_id    = childentry.bioentry_id;
 > > 
 > > <ifdef>
 > > 
 > > 
 > > 
 > > 
 > 

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
Positions available at my lab: see http://stein.cshl.org/#hire
========================================================================