[GMOD-devel] Re: [Open-bio-l] Schema for genes & features & mappings to assemblies

Lincoln Stein lstein@cshl.org
Thu, 25 Apr 2002 14:49:11 -0400


Beautiful.  Should be be targetting PostgreSQL?  MySQL doesn't do views 
(yet).

Lincoln

On Wednesday 24 April 2002 21:30, Chris Mungall wrote:
> Here is an example of one way of doing things such that we can all
> agree to disagree yet remain one happy family.
>
> It's not perfect, but I think it's better than the alternative which
> seems to be to solidify a compromise schema which no ones really happy
> with, or force everyone to use overcomplex adapters.
>
> It's a component-based solution rather than a monolithic one, the SQL
> DDL follows the description below
>
>  ------
>
> 1. Definitions (up for debate)
>    1 level assembly - features all stored on top level seqs
>              assembly table may still be useful; eg for getting
>              entry units - or seqfeatures could be used instead,
>              e.g. like GGB
>    2 level assembly - e.g. contigs on a chromosome. unspecified as
>              to whether features live on contigs, or both levels
>    n level assembly - e.g. chroms, contigs, reads. unspecified as
>              to whether features live on mixed levels, and to whether
>              the depth is fixed or variable
>
> 2. All client code can expect 2 relations to be present: assembly, and
>  dnafrag, defined below.
>
> 3. Client code can assume 2 level assemblies by default. Adaptors
>  should take care of transformations and/or the lite-client-bridge
>  (see 5 below) can be used
>
> 4. Client code written expecting flat assemblies (ie ignoring the
>  assembly relation altogether) won't break, but they
>  will display incomplete data (ie missing chrom features from contigs
>  or vice versa) IF the data is stored in a 2 level manner.
>
> 5. An optional bridge layer is provided for lite-clients that expect all
>  the data to be present in a flat assembly. This layer is sufficient
>  for read-only, but currently not for updates (although it could be
>  extended to do so). This layer is also useful for direct data
>  exploration via SQL.
>
> 6. n-level assemblies are not assumed as default. Code assuming n-level
>  assemblies will obviously work as n-level subsumes 1/2 level. An
>  n-level assembly component can be used, a view is used so that the
>  core 2-level assembly model is supported, although specialized
>  n-level assembly update code would be required.
>
> 7. Views are utilized but this doesn't marginalise mysql pre 4.1 - the
>  views could be materialized in a read-only db, or they could act as
>  specifications for a programmatic adapter layer.
>
> 8. The GGB sequence shredding idea is used, via the dnafrag relation. This
>  is necessary for large seqs with mysql. If you're DBMS is happy with
>  large seqs, then you still have to support the dnafrag relation, but you
>  can use a view with virtually no loss in speed.
>
> 9. project-centric column names like "chromosome" are avoided; eg
>  drosophila has chromosome arms as top level sequences
>
> ===========================
>
> I have munged all the components into a single file with ifdefs here,
> in reality they would be in seperate component files.
>
> These are the different builds possible:
>
> core - a good choice for all metazoan s. this part should follow
> ensembl rather well. assumes that you are doing data management such
> that a 2 level assembly is beneficial.
>
> smallseq - if either the genome consists of smallish unordered
> contigs, or the fully sequence genome has smallish chromosomes.
>
> 1-level  - all the features are flattened onto the biggest seq units
>
> n-level  - will require extra code to fully utilise this
>
> None of the table/colnames are set in stone, this is just to give a
> flavour of a possible solution.
>
> <ifdef core, 1-level-frag, smallseq>
>
> # child seqs (eg clones/contigs) are expected to be
> # all on the fwd strand in this example
>
> CREATE TABLE assembly (
>     assembly_id unsigned NOT NULL PRIMARY KEY auto_increment,
>     integer parentseq_id not null,
>     FOREIGN KEY parentseq_id REFERENCES seq(seq_id),
>     parent_start integer not null,
>     parent_end integer not null,
>     integer childseq_id not null,
>     FOREIGN KEY parentseq_id REFERENCES seq(seq_id),
>     child_start integer not null,
>     child_end integer not null
> );
>
> <ifdef>
>
> <ifdef n-level>
>
> CREATE TABLE assembly_nlevel (
>     assembly_id unsigned NOT NULL PRIMARY KEY auto_increment,
>     integer parentseq_id not null,
>     foreign key parentseq_id references seq(seq_id),
>     parent_start integer not null,
>     parent_end integer not null,
>     integer childseq_id not null,
>     foreign key parentseq_id references seq(seq_id),
>     child_start integer not null,
>     child_end integer not null
> );
>
> CREATE VIEW assembly AS
>   ....  <this is tricky - it depends on whether the level is fixed or
>   whether you can have mix and match 1, 2, 3 etc level in one db>
>
> <ifdef>
>
> <ifdef 1-level>
>
> # for most genomes, it makes sense to 'shred' the sequence
>
> # if you have a 1-level assembly (ie you have no need of
> # an assembly table) but your sequences are too big to
> # store directly, eg in mysql, then you will want to
> # use this table to store them in smaller chunks
>
> # getting subsequences as fast as possible is something
> # that is core to all genome annotation databases, so this
> # relation is expected; it could be implemented differently,
> # see below.
>
> # open question: how does the client decide when to use
> # dnafrag and when to use the biosequence table? Should
> # dnafrag be optional?
>
> CREATE TABLE dnafrag (
>     integer seq_id not null,
>     foreign key seq_id references seq(seq_id),
>     integer fstart not null,
>     integer fend not null,
>     biosequence_str mediumtext not null
> );
>
> <ifdef>
>
> <ifdef core>
>
> # use this component if you have a 2 or n level assemblies
> # and the top level sequences are too big for your DBMS to
> # handle well
>
> # note; this is a slow implementation becuase of the
> # substring; we could easily do it without
> # and just extend the frag to include the full
> # child (eg clone) boundaries
>
> # open question: can client code assume dna fragments are abutting /
> # have no overlap extent
>
> # materialize the view for warehouse dbs for faster performance
>
> CREATE VIEW dnafrag
>  AS SELECT parentseq_id AS seq_id
>            substring(sequence.biosequence_str,
>                      child_start,
>                      child_end) AS biosequence_str,
>            parent_start AS fstart,
>            child_start  AS fend
>  FROM assembly, sequence
>  WHERE sequence.sequence_id = assembly.childseq_id;
>
> <ifdef>
>
> <ifdef smallseq>
>
> # if we have either a small genome, or
> # a big genome for which there is no assembly,
> # only unordered contigs of a small size
> # (small defined as whatever is a scalable seq
> #  size for your DBMS)
> # then it doesn't make sense to 'shred' into
> # manageable size pieces, but we should
> # support the interface/relation
> CREATE VIEW dnafrag
>  AS SELECT sequence_id AS seq_id
>            biosequence_str AS biosequence_str,
>            1 AS fstart,
>            seq_length  AS fend
>  FROM sequence;
>
> <ifdef>
>
> <ifdef gff-l2>
>
> # lite-clients may want a simple GFF view of
> # the world, with everything in a flat coordinate
> # system. this view would be used if your features
> # were stored in the leaf nodes in your 2-level assembly;
> # other views could be made e.g. for features stored
> # on mixed levels
>
> # this relation is intended to be conformant to
> # the GGB fdata relation
>
> # this is slightly convoluted because of
> # the way sequences/locations work in biosql
>
> # note the off-by-ones cancel eachother below
> CREATE VIEW fdata
>  AS SELECT seqfeature_id     AS fid,
>            parententry.accession       AS fref,
>            fl.seq_start + (a.parent_start - a.child_start)
>                       AS fstart,
>            fl.seq_end + (a.parent_start - a.child_start)
>                       AS fstop,
>            f.seqfeature_key_id   AS ftypeid,
>            NULL              AS fscore,
>            fl.seq_strand     AS fstrand,
>            NULL              AS fphase,
>            f.seqfeature_id   AS gid,
>            NULL              AS ftarget_start,
>            NULL              AS ftarget_stop
>     FROM seqfeature f,
>          seqfeature_location fl,
>          assembly a,
>          bioentry childentry,
>          bioentry parententry,
>          biosequence childseq,
>          biosequence parentseq,
>     WHERE
>           a.childseq_id    = childseq.sequence_id                AND
>           childseq.bioentry_id    = childentry.bioentry_id       AND
>           a.parentseq_id    = parentseq.sequence_id                AND
>           parentseq.bioentry_id    = parententry.bioentry_id       AND
>           fl.seqfeature_id = f.seqfeature_id                     AND
>           f.bioentry_id    = childentry.bioentry_id;
>
> <ifdef>
>
>
>
>
> _______________________________________________
> Gmod-devel mailing list
> Gmod-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gmod-devel