[GMOD-devel] Re: [Open-bio-l] Schema for genes & features &mappings to assemblies

Tue, 30 Apr 2002 17:02:31 -0700

I like this proposal, but ...

> 9. project-centric column names like "chromosome" are avoided; eg
>  drosophila has chromosome arms as top level sequences

So, I'm still confused about how I am supposed to store gene predictions, EST, RefSeq, or whatever mappings to chromosomes in an assembly, such that I can answer queries like 'show me all exons of genes and their lines of evidence that map between markers X and Y on chromosome 5 of mouse. Next, show how human genes map to this region, and which human chromosomes.'

Maybe someone can help me lifting my confusion.

How is this done in GMOD and Ensembl, and how does that map to BioSQL with the assembly proposal below?

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp@gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

> -----Original Message-----
> From: Chris Mungall [mailto:cjm@bdgp.lbl.gov]
> Sent: Wednesday, April 24, 2002 6:31 PM
> To: Lincoln Stein
> Cc: Elia Stupka; Thomas Down; Ewan Birney; Hilmar Lapp; GMOD Devel
> (E-mail); OBDA BioSQL (E-mail)
> Subject: Re: [GMOD-devel] Re: [Open-bio-l] Schema for genes & features
> &mappings to assemblies
> 
> 
> 
> Here is an example of one way of doing things such that we can all
> agree to disagree yet remain one happy family.
> 
> It's not perfect, but I think it's better than the alternative which
> seems to be to solidify a compromise schema which no ones really happy
> with, or force everyone to use overcomplex adapters.
> 
> It's a component-based solution rather than a monolithic one, the SQL
> DDL follows the description below
> 
>  ------
> 
> 1. Definitions (up for debate)
>    1 level assembly - features all stored on top level seqs
>              assembly table may still be useful; eg for getting
>              entry units - or seqfeatures could be used instead,
>              e.g. like GGB
>    2 level assembly - e.g. contigs on a chromosome. unspecified as
>              to whether features live on contigs, or both levels
>    n level assembly - e.g. chroms, contigs, reads. unspecified as
>              to whether features live on mixed levels, and to whether
>              the depth is fixed or variable
> 
> 2. All client code can expect 2 relations to be present: assembly, and
>  dnafrag, defined below.
> 
> 3. Client code can assume 2 level assemblies by default. Adaptors
>  should take care of transformations and/or the lite-client-bridge
>  (see 5 below) can be used
> 
> 4. Client code written expecting flat assemblies (ie ignoring the
>  assembly relation altogether) won't break, but they
>  will display incomplete data (ie missing chrom features from contigs
>  or vice versa) IF the data is stored in a 2 level manner.
> 
> 5. An optional bridge layer is provided for lite-clients that 
> expect all
>  the data to be present in a flat assembly. This layer is sufficient
>  for read-only, but currently not for updates (although it could be
>  extended to do so). This layer is also useful for direct data
>  exploration via SQL.
> 
> 6. n-level assemblies are not assumed as default. Code 
> assuming n-level
>  assemblies will obviously work as n-level subsumes 1/2 level. An
>  n-level assembly component can be used, a view is used so that the
>  core 2-level assembly model is supported, although specialized
>  n-level assembly update code would be required.
> 
> 7. Views are utilized but this doesn't marginalise mysql pre 4.1 - the
>  views could be materialized in a read-only db, or they could act as
>  specifications for a programmatic adapter layer.
> 
> 8. The GGB sequence shredding idea is used, via the dnafrag 
> relation. This
>  is necessary for large seqs with mysql. If you're DBMS is happy with
>  large seqs, then you still have to support the dnafrag 
> relation, but you
>  can use a view with virtually no loss in speed.
> 
> 9. project-centric column names like "chromosome" are avoided; eg
>  drosophila has chromosome arms as top level sequences
> 
> ===========================
> 
> I have munged all the components into a single file with ifdefs here,
> in reality they would be in seperate component files.
> 
> These are the different builds possible:
> 
> core - a good choice for all metazoan s. this part should follow
> ensembl rather well. assumes that you are doing data management such
> that a 2 level assembly is beneficial.
> 
> smallseq - if either the genome consists of smallish unordered
> contigs, or the fully sequence genome has smallish chromosomes.
> 
> 1-level  - all the features are flattened onto the biggest seq units
> 
> n-level  - will require extra code to fully utilise this
> 
> None of the table/colnames are set in stone, this is just to give a
> flavour of a possible solution.
> 
> <ifdef core, 1-level-frag, smallseq>
> 
> # child seqs (eg clones/contigs) are expected to be
> # all on the fwd strand in this example
> 
> CREATE TABLE assembly (
>     assembly_id unsigned NOT NULL PRIMARY KEY auto_increment,
>     integer parentseq_id not null,
>     FOREIGN KEY parentseq_id REFERENCES seq(seq_id),
>     parent_start integer not null,
>     parent_end integer not null,
>     integer childseq_id not null,
>     FOREIGN KEY parentseq_id REFERENCES seq(seq_id),
>     child_start integer not null,
>     child_end integer not null
> );
> 
> <ifdef>
> 
> <ifdef n-level>
> 
> CREATE TABLE assembly_nlevel (
>     assembly_id unsigned NOT NULL PRIMARY KEY auto_increment,
>     integer parentseq_id not null,
>     foreign key parentseq_id references seq(seq_id),
>     parent_start integer not null,
>     parent_end integer not null,
>     integer childseq_id not null,
>     foreign key parentseq_id references seq(seq_id),
>     child_start integer not null,
>     child_end integer not null
> );
> 
> CREATE VIEW assembly AS
>   ....  <this is tricky - it depends on whether the level is fixed or
>   whether you can have mix and match 1, 2, 3 etc level in one db>
> 
> <ifdef>
> 
> <ifdef 1-level>
> 
> # for most genomes, it makes sense to 'shred' the sequence
> 
> # if you have a 1-level assembly (ie you have no need of
> # an assembly table) but your sequences are too big to
> # store directly, eg in mysql, then you will want to
> # use this table to store them in smaller chunks
> 
> # getting subsequences as fast as possible is something
> # that is core to all genome annotation databases, so this
> # relation is expected; it could be implemented differently,
> # see below.
> 
> # open question: how does the client decide when to use
> # dnafrag and when to use the biosequence table? Should
> # dnafrag be optional?
> 
> CREATE TABLE dnafrag (
>     integer seq_id not null,
>     foreign key seq_id references seq(seq_id),
>     integer fstart not null,
>     integer fend not null,
>     biosequence_str mediumtext not null
> );
> 
> <ifdef>
> 
> <ifdef core>
> 
> # use this component if you have a 2 or n level assemblies
> # and the top level sequences are too big for your DBMS to
> # handle well
> 
> # note; this is a slow implementation becuase of the
> # substring; we could easily do it without
> # and just extend the frag to include the full
> # child (eg clone) boundaries
> 
> # open question: can client code assume dna fragments are abutting /
> # have no overlap extent
> 
> # materialize the view for warehouse dbs for faster performance
> 
> CREATE VIEW dnafrag
>  AS SELECT parentseq_id AS seq_id
>            substring(sequence.biosequence_str,
>                      child_start,
>                      child_end) AS biosequence_str,
>            parent_start AS fstart,
>            child_start  AS fend
>  FROM assembly, sequence
>  WHERE sequence.sequence_id = assembly.childseq_id;
> 
> <ifdef>
> 
> <ifdef smallseq>
> 
> # if we have either a small genome, or
> # a big genome for which there is no assembly,
> # only unordered contigs of a small size
> # (small defined as whatever is a scalable seq
> #  size for your DBMS)
> # then it doesn't make sense to 'shred' into
> # manageable size pieces, but we should
> # support the interface/relation
> CREATE VIEW dnafrag
>  AS SELECT sequence_id AS seq_id
>            biosequence_str AS biosequence_str,
>            1 AS fstart,
>            seq_length  AS fend
>  FROM sequence;
> 
> <ifdef>
> 
> <ifdef gff-l2>
> 
> # lite-clients may want a simple GFF view of
> # the world, with everything in a flat coordinate
> # system. this view would be used if your features
> # were stored in the leaf nodes in your 2-level assembly;
> # other views could be made e.g. for features stored
> # on mixed levels
> 
> # this relation is intended to be conformant to
> # the GGB fdata relation
> 
> # this is slightly convoluted because of
> # the way sequences/locations work in biosql
> 
> # note the off-by-ones cancel eachother below
> CREATE VIEW fdata
>  AS SELECT seqfeature_id     AS fid,
>            parententry.accession       AS fref,
>            fl.seq_start + (a.parent_start - a.child_start)
>                       AS fstart,
>            fl.seq_end + (a.parent_start - a.child_start)
>                       AS fstop,
>            f.seqfeature_key_id   AS ftypeid,
>            NULL              AS fscore,
>            fl.seq_strand     AS fstrand,
>            NULL              AS fphase,
>            f.seqfeature_id   AS gid,
>            NULL              AS ftarget_start,
>            NULL              AS ftarget_stop
>     FROM seqfeature f,
>          seqfeature_location fl,
>          assembly a,
>          bioentry childentry,
>          bioentry parententry,
>          biosequence childseq,
>          biosequence parentseq,
>     WHERE
>           a.childseq_id    = childseq.sequence_id                AND
>           childseq.bioentry_id    = childentry.bioentry_id       AND
>           a.parentseq_id    = parentseq.sequence_id                AND
>           parentseq.bioentry_id    = parententry.bioentry_id       AND
>           fl.seqfeature_id = f.seqfeature_id                     AND
>           f.bioentry_id    = childentry.bioentry_id;
> 
> <ifdef>
> 
> 
> 
>