[GMOD-devel] Re: [Open-bio-l] Schema for genes & features & mappings to assemblies

Chris Mungall cjm@bdgp.lbl.gov
Wed, 24 Apr 2002 18:30:30 -0700 (PDT)


Here is an example of one way of doing things such that we can all
agree to disagree yet remain one happy family.

It's not perfect, but I think it's better than the alternative which
seems to be to solidify a compromise schema which no ones really happy
with, or force everyone to use overcomplex adapters.

It's a component-based solution rather than a monolithic one, the SQL
DDL follows the description below

 ------

1. Definitions (up for debate)
   1 level assembly - features all stored on top level seqs
             assembly table may still be useful; eg for getting
             entry units - or seqfeatures could be used instead,
             e.g. like GGB
   2 level assembly - e.g. contigs on a chromosome. unspecified as
             to whether features live on contigs, or both levels
   n level assembly - e.g. chroms, contigs, reads. unspecified as
             to whether features live on mixed levels, and to whether
             the depth is fixed or variable

2. All client code can expect 2 relations to be present: assembly, and
 dnafrag, defined below.

3. Client code can assume 2 level assemblies by default. Adaptors
 should take care of transformations and/or the lite-client-bridge
 (see 5 below) can be used

4. Client code written expecting flat assemblies (ie ignoring the
 assembly relation altogether) won't break, but they
 will display incomplete data (ie missing chrom features from contigs
 or vice versa) IF the data is stored in a 2 level manner.

5. An optional bridge layer is provided for lite-clients that expect all
 the data to be present in a flat assembly. This layer is sufficient
 for read-only, but currently not for updates (although it could be
 extended to do so). This layer is also useful for direct data
 exploration via SQL.

6. n-level assemblies are not assumed as default. Code assuming n-level
 assemblies will obviously work as n-level subsumes 1/2 level. An
 n-level assembly component can be used, a view is used so that the
 core 2-level assembly model is supported, although specialized
 n-level assembly update code would be required.

7. Views are utilized but this doesn't marginalise mysql pre 4.1 - the
 views could be materialized in a read-only db, or they could act as
 specifications for a programmatic adapter layer.

8. The GGB sequence shredding idea is used, via the dnafrag relation. This
 is necessary for large seqs with mysql. If you're DBMS is happy with
 large seqs, then you still have to support the dnafrag relation, but you
 can use a view with virtually no loss in speed.

9. project-centric column names like "chromosome" are avoided; eg
 drosophila has chromosome arms as top level sequences

===========================

I have munged all the components into a single file with ifdefs here,
in reality they would be in seperate component files.

These are the different builds possible:

core - a good choice for all metazoan s. this part should follow
ensembl rather well. assumes that you are doing data management such
that a 2 level assembly is beneficial.

smallseq - if either the genome consists of smallish unordered
contigs, or the fully sequence genome has smallish chromosomes.

1-level  - all the features are flattened onto the biggest seq units

n-level  - will require extra code to fully utilise this

None of the table/colnames are set in stone, this is just to give a
flavour of a possible solution.

<ifdef core, 1-level-frag, smallseq>

# child seqs (eg clones/contigs) are expected to be
# all on the fwd strand in this example

CREATE TABLE assembly (
    assembly_id unsigned NOT NULL PRIMARY KEY auto_increment,
    integer parentseq_id not null,
    FOREIGN KEY parentseq_id REFERENCES seq(seq_id),
    parent_start integer not null,
    parent_end integer not null,
    integer childseq_id not null,
    FOREIGN KEY parentseq_id REFERENCES seq(seq_id),
    child_start integer not null,
    child_end integer not null
);

<ifdef>

<ifdef n-level>

CREATE TABLE assembly_nlevel (
    assembly_id unsigned NOT NULL PRIMARY KEY auto_increment,
    integer parentseq_id not null,
    foreign key parentseq_id references seq(seq_id),
    parent_start integer not null,
    parent_end integer not null,
    integer childseq_id not null,
    foreign key parentseq_id references seq(seq_id),
    child_start integer not null,
    child_end integer not null
);

CREATE VIEW assembly AS
  ....  <this is tricky - it depends on whether the level is fixed or
  whether you can have mix and match 1, 2, 3 etc level in one db>

<ifdef>

<ifdef 1-level>

# for most genomes, it makes sense to 'shred' the sequence

# if you have a 1-level assembly (ie you have no need of
# an assembly table) but your sequences are too big to
# store directly, eg in mysql, then you will want to
# use this table to store them in smaller chunks

# getting subsequences as fast as possible is something
# that is core to all genome annotation databases, so this
# relation is expected; it could be implemented differently,
# see below.

# open question: how does the client decide when to use
# dnafrag and when to use the biosequence table? Should
# dnafrag be optional?

CREATE TABLE dnafrag (
    integer seq_id not null,
    foreign key seq_id references seq(seq_id),
    integer fstart not null,
    integer fend not null,
    biosequence_str mediumtext not null
);

<ifdef>

<ifdef core>

# use this component if you have a 2 or n level assemblies
# and the top level sequences are too big for your DBMS to
# handle well

# note; this is a slow implementation becuase of the
# substring; we could easily do it without
# and just extend the frag to include the full
# child (eg clone) boundaries

# open question: can client code assume dna fragments are abutting /
# have no overlap extent

# materialize the view for warehouse dbs for faster performance

CREATE VIEW dnafrag
 AS SELECT parentseq_id AS seq_id
           substring(sequence.biosequence_str,
                     child_start,
                     child_end) AS biosequence_str,
           parent_start AS fstart,
           child_start  AS fend
 FROM assembly, sequence
 WHERE sequence.sequence_id = assembly.childseq_id;

<ifdef>

<ifdef smallseq>

# if we have either a small genome, or
# a big genome for which there is no assembly,
# only unordered contigs of a small size
# (small defined as whatever is a scalable seq
#  size for your DBMS)
# then it doesn't make sense to 'shred' into
# manageable size pieces, but we should
# support the interface/relation
CREATE VIEW dnafrag
 AS SELECT sequence_id AS seq_id
           biosequence_str AS biosequence_str,
           1 AS fstart,
           seq_length  AS fend
 FROM sequence;

<ifdef>

<ifdef gff-l2>

# lite-clients may want a simple GFF view of
# the world, with everything in a flat coordinate
# system. this view would be used if your features
# were stored in the leaf nodes in your 2-level assembly;
# other views could be made e.g. for features stored
# on mixed levels

# this relation is intended to be conformant to
# the GGB fdata relation

# this is slightly convoluted because of
# the way sequences/locations work in biosql

# note the off-by-ones cancel eachother below
CREATE VIEW fdata
 AS SELECT seqfeature_id     AS fid,
           parententry.accession       AS fref,
           fl.seq_start + (a.parent_start - a.child_start)
                      AS fstart,
           fl.seq_end + (a.parent_start - a.child_start)
                      AS fstop,
           f.seqfeature_key_id   AS ftypeid,
           NULL              AS fscore,
           fl.seq_strand     AS fstrand,
           NULL              AS fphase,
           f.seqfeature_id   AS gid,
           NULL              AS ftarget_start,
           NULL              AS ftarget_stop
    FROM seqfeature f,
         seqfeature_location fl,
         assembly a,
         bioentry childentry,
         bioentry parententry,
         biosequence childseq,
         biosequence parentseq,
    WHERE
          a.childseq_id    = childseq.sequence_id                AND
          childseq.bioentry_id    = childentry.bioentry_id       AND
          a.parentseq_id    = parentseq.sequence_id                AND
          parentseq.bioentry_id    = parententry.bioentry_id       AND
          fl.seqfeature_id = f.seqfeature_id                     AND
          f.bioentry_id    = childentry.bioentry_id;

<ifdef>