[GMOD-devel] Re: [Open-bio-l] Schema for genes & features &mappings to assemblies
Lincoln Stein
lstein@cshl.org
Wed, 1 May 2002 21:59:09 -0400
Nothing is wrong with that!
Lincoln
Hilmar Lapp writes:
> So Ensembl does have a Chromosome table and does store explicitly how contigs map to chromosomes position-wise. What's bad about that?
>
> -hilmar
> --
> -------------------------------------------------------------
> Hilmar Lapp email: lapp@gnf.org
> GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
>
> > -----Original Message-----
> > From: Hilmar Lapp
> > Sent: Tuesday, April 30, 2002 5:03 PM
> > To: 'Chris Mungall'; Lincoln Stein
> > Cc: Elia Stupka; Thomas Down; Ewan Birney; GMOD Devel (E-mail); OBDA
> > BioSQL (E-mail)
> > Subject: RE: [GMOD-devel] Re: [Open-bio-l] Schema for genes & features
> > &mappings to assemblies
> >
> >
> > I like this proposal, but ...
> >
> > > 9. project-centric column names like "chromosome" are avoided; eg
> > > drosophila has chromosome arms as top level sequences
> >
> > So, I'm still confused about how I am supposed to store gene
> > predictions, EST, RefSeq, or whatever mappings to chromosomes
> > in an assembly, such that I can answer queries like 'show me
> > all exons of genes and their lines of evidence that map
> > between markers X and Y on chromosome 5 of mouse. Next, show
> > how human genes map to this region, and which human chromosomes.'
> >
> > Maybe someone can help me lifting my confusion.
> >
> > How is this done in GMOD and Ensembl, and how does that map
> > to BioSQL with the assembly proposal below?
> >
> > -hilmar
> > --
> > -------------------------------------------------------------
> > Hilmar Lapp email: lapp@gnf.org
> > GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
> > -------------------------------------------------------------
> >
> >
> >
> > > -----Original Message-----
> > > From: Chris Mungall [mailto:cjm@bdgp.lbl.gov]
> > > Sent: Wednesday, April 24, 2002 6:31 PM
> > > To: Lincoln Stein
> > > Cc: Elia Stupka; Thomas Down; Ewan Birney; Hilmar Lapp; GMOD Devel
> > > (E-mail); OBDA BioSQL (E-mail)
> > > Subject: Re: [GMOD-devel] Re: [Open-bio-l] Schema for genes
> > & features
> > > &mappings to assemblies
> > >
> > >
> > >
> > > Here is an example of one way of doing things such that we can all
> > > agree to disagree yet remain one happy family.
> > >
> > > It's not perfect, but I think it's better than the alternative which
> > > seems to be to solidify a compromise schema which no ones
> > really happy
> > > with, or force everyone to use overcomplex adapters.
> > >
> > > It's a component-based solution rather than a monolithic
> > one, the SQL
> > > DDL follows the description below
> > >
> > > ------
> > >
> > > 1. Definitions (up for debate)
> > > 1 level assembly - features all stored on top level seqs
> > > assembly table may still be useful; eg for getting
> > > entry units - or seqfeatures could be used instead,
> > > e.g. like GGB
> > > 2 level assembly - e.g. contigs on a chromosome. unspecified as
> > > to whether features live on contigs, or both levels
> > > n level assembly - e.g. chroms, contigs, reads. unspecified as
> > > to whether features live on mixed levels, and
> > to whether
> > > the depth is fixed or variable
> > >
> > > 2. All client code can expect 2 relations to be present:
> > assembly, and
> > > dnafrag, defined below.
> > >
> > > 3. Client code can assume 2 level assemblies by default. Adaptors
> > > should take care of transformations and/or the lite-client-bridge
> > > (see 5 below) can be used
> > >
> > > 4. Client code written expecting flat assemblies (ie ignoring the
> > > assembly relation altogether) won't break, but they
> > > will display incomplete data (ie missing chrom features
> > from contigs
> > > or vice versa) IF the data is stored in a 2 level manner.
> > >
> > > 5. An optional bridge layer is provided for lite-clients that
> > > expect all
> > > the data to be present in a flat assembly. This layer is sufficient
> > > for read-only, but currently not for updates (although it could be
> > > extended to do so). This layer is also useful for direct data
> > > exploration via SQL.
> > >
> > > 6. n-level assemblies are not assumed as default. Code
> > > assuming n-level
> > > assemblies will obviously work as n-level subsumes 1/2 level. An
> > > n-level assembly component can be used, a view is used so that the
> > > core 2-level assembly model is supported, although specialized
> > > n-level assembly update code would be required.
> > >
> > > 7. Views are utilized but this doesn't marginalise mysql
> > pre 4.1 - the
> > > views could be materialized in a read-only db, or they could act as
> > > specifications for a programmatic adapter layer.
> > >
> > > 8. The GGB sequence shredding idea is used, via the dnafrag
> > > relation. This
> > > is necessary for large seqs with mysql. If you're DBMS is
> > happy with
> > > large seqs, then you still have to support the dnafrag
> > > relation, but you
> > > can use a view with virtually no loss in speed.
> > >
> > > 9. project-centric column names like "chromosome" are avoided; eg
> > > drosophila has chromosome arms as top level sequences
> > >
> > > ===========================
> > >
> > > I have munged all the components into a single file with
> > ifdefs here,
> > > in reality they would be in seperate component files.
> > >
> > > These are the different builds possible:
> > >
> > > core - a good choice for all metazoan s. this part should follow
> > > ensembl rather well. assumes that you are doing data management such
> > > that a 2 level assembly is beneficial.
> > >
> > > smallseq - if either the genome consists of smallish unordered
> > > contigs, or the fully sequence genome has smallish chromosomes.
> > >
> > > 1-level - all the features are flattened onto the biggest seq units
> > >
> > > n-level - will require extra code to fully utilise this
> > >
> > > None of the table/colnames are set in stone, this is just to give a
> > > flavour of a possible solution.
> > >
> > > <ifdef core, 1-level-frag, smallseq>
> > >
> > > # child seqs (eg clones/contigs) are expected to be
> > > # all on the fwd strand in this example
> > >
> > > CREATE TABLE assembly (
> > > assembly_id unsigned NOT NULL PRIMARY KEY auto_increment,
> > > integer parentseq_id not null,
> > > FOREIGN KEY parentseq_id REFERENCES seq(seq_id),
> > > parent_start integer not null,
> > > parent_end integer not null,
> > > integer childseq_id not null,
> > > FOREIGN KEY parentseq_id REFERENCES seq(seq_id),
> > > child_start integer not null,
> > > child_end integer not null
> > > );
> > >
> > > <ifdef>
> > >
> > > <ifdef n-level>
> > >
> > > CREATE TABLE assembly_nlevel (
> > > assembly_id unsigned NOT NULL PRIMARY KEY auto_increment,
> > > integer parentseq_id not null,
> > > foreign key parentseq_id references seq(seq_id),
> > > parent_start integer not null,
> > > parent_end integer not null,
> > > integer childseq_id not null,
> > > foreign key parentseq_id references seq(seq_id),
> > > child_start integer not null,
> > > child_end integer not null
> > > );
> > >
> > > CREATE VIEW assembly AS
> > > .... <this is tricky - it depends on whether the level
> > is fixed or
> > > whether you can have mix and match 1, 2, 3 etc level in one db>
> > >
> > > <ifdef>
> > >
> > > <ifdef 1-level>
> > >
> > > # for most genomes, it makes sense to 'shred' the sequence
> > >
> > > # if you have a 1-level assembly (ie you have no need of
> > > # an assembly table) but your sequences are too big to
> > > # store directly, eg in mysql, then you will want to
> > > # use this table to store them in smaller chunks
> > >
> > > # getting subsequences as fast as possible is something
> > > # that is core to all genome annotation databases, so this
> > > # relation is expected; it could be implemented differently,
> > > # see below.
> > >
> > > # open question: how does the client decide when to use
> > > # dnafrag and when to use the biosequence table? Should
> > > # dnafrag be optional?
> > >
> > > CREATE TABLE dnafrag (
> > > integer seq_id not null,
> > > foreign key seq_id references seq(seq_id),
> > > integer fstart not null,
> > > integer fend not null,
> > > biosequence_str mediumtext not null
> > > );
> > >
> > > <ifdef>
> > >
> > > <ifdef core>
> > >
> > > # use this component if you have a 2 or n level assemblies
> > > # and the top level sequences are too big for your DBMS to
> > > # handle well
> > >
> > > # note; this is a slow implementation becuase of the
> > > # substring; we could easily do it without
> > > # and just extend the frag to include the full
> > > # child (eg clone) boundaries
> > >
> > > # open question: can client code assume dna fragments are abutting /
> > > # have no overlap extent
> > >
> > > # materialize the view for warehouse dbs for faster performance
> > >
> > > CREATE VIEW dnafrag
> > > AS SELECT parentseq_id AS seq_id
> > > substring(sequence.biosequence_str,
> > > child_start,
> > > child_end) AS biosequence_str,
> > > parent_start AS fstart,
> > > child_start AS fend
> > > FROM assembly, sequence
> > > WHERE sequence.sequence_id = assembly.childseq_id;
> > >
> > > <ifdef>
> > >
> > > <ifdef smallseq>
> > >
> > > # if we have either a small genome, or
> > > # a big genome for which there is no assembly,
> > > # only unordered contigs of a small size
> > > # (small defined as whatever is a scalable seq
> > > # size for your DBMS)
> > > # then it doesn't make sense to 'shred' into
> > > # manageable size pieces, but we should
> > > # support the interface/relation
> > > CREATE VIEW dnafrag
> > > AS SELECT sequence_id AS seq_id
> > > biosequence_str AS biosequence_str,
> > > 1 AS fstart,
> > > seq_length AS fend
> > > FROM sequence;
> > >
> > > <ifdef>
> > >
> > > <ifdef gff-l2>
> > >
> > > # lite-clients may want a simple GFF view of
> > > # the world, with everything in a flat coordinate
> > > # system. this view would be used if your features
> > > # were stored in the leaf nodes in your 2-level assembly;
> > > # other views could be made e.g. for features stored
> > > # on mixed levels
> > >
> > > # this relation is intended to be conformant to
> > > # the GGB fdata relation
> > >
> > > # this is slightly convoluted because of
> > > # the way sequences/locations work in biosql
> > >
> > > # note the off-by-ones cancel eachother below
> > > CREATE VIEW fdata
> > > AS SELECT seqfeature_id AS fid,
> > > parententry.accession AS fref,
> > > fl.seq_start + (a.parent_start - a.child_start)
> > > AS fstart,
> > > fl.seq_end + (a.parent_start - a.child_start)
> > > AS fstop,
> > > f.seqfeature_key_id AS ftypeid,
> > > NULL AS fscore,
> > > fl.seq_strand AS fstrand,
> > > NULL AS fphase,
> > > f.seqfeature_id AS gid,
> > > NULL AS ftarget_start,
> > > NULL AS ftarget_stop
> > > FROM seqfeature f,
> > > seqfeature_location fl,
> > > assembly a,
> > > bioentry childentry,
> > > bioentry parententry,
> > > biosequence childseq,
> > > biosequence parentseq,
> > > WHERE
> > > a.childseq_id = childseq.sequence_id AND
> > > childseq.bioentry_id = childentry.bioentry_id AND
> > > a.parentseq_id = parentseq.sequence_id
> > AND
> > > parentseq.bioentry_id =
> > parententry.bioentry_id AND
> > > fl.seqfeature_id = f.seqfeature_id AND
> > > f.bioentry_id = childentry.bioentry_id;
> > >
> > > <ifdef>
> > >
> > >
> > >
> > >
> >
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l@open-bio.org
> http://open-bio.org/mailman/listinfo/open-bio-l
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein@cshl.org Cold Spring Harbor, NY
Positions available at my lab: see http://stein.cshl.org/#hire
========================================================================