[DAS2] feature locations

Sat Aug 19 02:49:27 UTC 2006

Ed:
> I think all of us this morning, except you,  want
>
> 2) Yes, parent region must encompass all child regions
> 3) Yes, a single segment that encompasses all child regions
> 4) In your example:
>   overlaps(30,40) returns the whole parent and child
>   inside(30,40) returns neither the parent nor the child

That's what I figured was the case.

> The user (client) is responsible for asking for things that make sense.
> For mRNA transcripts and exons, an overlaps query is sensible.

Isn't the client also responsible for making sure the features
makes sense?  (Possibly validated in the server.)

In the case which comes up most often - transcripts and exons -
it makes sense that the client give locations to both the
transcript and the exons.  For that feature type doing #3 is right.

I'm not convinced that it's correct for the general case.

> Here is my two cents about the "catalytic site" you talk about....

I can come up with more examples in the protein world.  "Surface
residues".  "S-S bonded residues".  These don't require 3D
structure for visualization.  Eg, I should be able to see
"surface residues" highlighted differently than others even
on a 1D display.  Useful when homology modeling.

> I agree that a "catalytic site" such as you describe requires some
> thought.  But it requires thought from the curator on how to describe
> it, not smartness of the DAS server itself.  If the catalytic site is
> composed of parts of exons on a single mRNA, they should be maybe be  
> put
> into a parent-child relationship.  If different components of the
> catalytic site are on different mRNAs that fold-up and combine into a
> complex compound (like hemoglobin) then the parts that are on different
> mRNAs probably should be treated as different features.  Or even more
> simply, there could be a feature type "catalytic site component" that
> can be a "part of" an exon.

(Naming ambiguity: "treated as different features" or "treated as
different feature groups"?  Per today's discussion I would have them
be different features in the same feature group.)

Well, I was thinking of proteins, and an annotation which is more
properly part of a structural assembly.  To make my objections
less needlessly complex, the site residues can all be on the same
chain.  For that case it still does not make sense to have a parent
feature have a location across all intermediate residues.  If a
the two cysteines of a S-S bond are at 22 and 98 then an overlaps
search of (30,50) should not return the S-S bond information.

Arguing proteins is wrong because they are so small.  Nearly
everyone will download everything and not do range searches
on the server.  Perhaps that's why my intuition is leading
me astray....

I've been trying to come up with some more DNA-centric examples.
I really don't know the domain well enough.  What about:

Some genes have multiple promoters.  EPD puts those into a
"promoters group".  See http://www.epd.isb-sib.ch/current/AP.html
for the known cases.  Here are three members from one group

FP   Rn IGF II      E1P1 :+R  EM:X17012.1    1+   18227; 28008.    036*1
FP   Rn IGF II      E2P2 :+S  EM:X17012.1    1+   19978; 25032.137 036*2
FP   Rn IGF II      E3P3+:+S  EM:X17012.1    1+   21966; 25033.155 036*3

The docs at http://www.epd.isb-sib.ch/current/usrman.html say these have
position numbers of 18227, 19978, 21966.

Would it be reasonable to want to annotate this as a "promoters
group" using a single DAS2 feature group?  If so, should the
parent include the portions between the three promoters?

Genbank is notorious for its complex annotations.  I looked for
interesting things (non-gene/CDS/exon/intron records).  Here are a few

The D-loop from a cow's mitochondria
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi? 
db=nucleotide&val=27543905
      D-loop          join(15791..16337,1..362)

D-loops appear to be a feature where it does not makes sense to have
the parent join the intermediate sequence.

The cat mitochondria record (I"m scanning gbmam hence cow and cat) at
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=1098523
has a feature

      misc_feature    join(16315..17009,1..865)
                      /note="control region; CR"

but I can't figure out what that means.

Jumping to another file, here's one from Tobacco leaf curl Japan virus
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=8096283

      stem_loop       join(1..19,2754..2761)

That's a nice structural example.  Strange that it's in two sections.
Perhaps that only works because the first section is terminal?

This example points out a class of RNA and ssDNA annotations on
shape, like pseudoknots, which are essentially structural.

Oh, and then there are functional RNA structures like ribozymes
structures where you might annotate the functional regions,
but that's back to the realm of the small.

I have managed to convince myself that the difference in viewpoints
is because of a difference in molecular expectations.  DNA really
doesn't do all that much.  It sits there and gets transcribed.
There are some structurally interesting regions but nothing like
what protein has or does. RNA and ssDNA are more interesting, but
they are small.

I did come across a paper titled "DNA supercoiling allows
enhancer action over a large distance" where it was best
to think of the 3D structure of DNA, but that sort of thing
is rare.

How portable should the FEATURE structure from DAS2 be for
2D protein annotations?  In the way I've been thinking of it
it's quite portable.  With this "parent locations must
overlap all children's locations" restriction everything
but the leaf locations will likely be useless blobs in protein
annotations.

> Anyway, that is *my* opinion.  #2 Yes, #3 Yes, and #4 the annotator is
> responsible for being smart.
>
> I can at least see now why you think there might be a problem, but I
> don't agree that it is a problem.

As #3 is trivially computed from the data, the only difference
I can see must be in the results from range searches done on
the server.  I'll write about that some other time.  This
email is long enough.  I'm off to bed.

					Andrew
					dalke at dalkescientific.com