[DAS] Finding the ADAM2 Gene via Ensembl DAS
Thomas Down
thomas@derkholm.net
Sun, 19 Jan 2003 17:59:35 +0000
Once upon a time, on a computer far far away, Ethan Cerami wrote:
>
> First some overview: if you click on this link:
> http://www.ensembl.org/Homo_sapiens/contigview?highlight=&chr=8&vc_start=38800000&vc_end=39190000&x=0&y=0,
> in the detailed panel on the bottom, you will see two
> known genes, ADAM18 and ADAM2.
>
> I am trying to get this same gene data out of Ensembl
> via DAS. I tried several Ensembl data sources,
> including: ensembl930, ens_ncbi30refseq
> (Ensembl-mapped Human RefSeqs), ens930cds (Ensembl
> CDS). I finally tried ens_ncbi30trans (NCBI
> Transcripts). Here's the query I sent:
>
> http://servlet.sanger.ac.uk:8080/das/ens_ncbi30trans/features?segment=8:38800000,39190000
>
> In the response, I got back 14 features, all named
> ADAM2, but each one is located at a different
> location.
>
> So, my questions:
>
> 1. Am I using the right Ensembl data source?
No, I don't believe that you are. The source you're looking
at is an NCBI genebuild, which I don't think can be expected
to be the same as Ensembl.
The core Ensembl data (including gene predictions) is on
/das/ensembl930/. But trying the query you show above on
this datasource isn't going to work... (see below).
> 2. Why do I get back 14 ADAM2 Genes, instead of just
> one?
One for each exon. The DAS protocol doesn't have any way
to return a single FEATURE element with a non-contiguous
location, so gene structures really have to be returned as
many individual FEATUREs grouped together. I note that
Ensembl actually predicts 13 exons for ADAM2. 14 is close
enough for me -- maybe NCBI managed to map a bit more UTR
in this case.
> 3. Why don't I get back the ADAM18 gene?
Don't know. I presume NCBI don't predict it (or, possibly,
put it somewhere else).
The big issue here is actually that DAS servers don't *have*
to provide you the annotation you want in chromosomal coordinates.
It was implemented in this way so that annotation could potentially
survive across assembly changes. The Ensembl DAS server actually
choses to serve gene structures in either contig coordinates
(if the whole gene fits) or else supercontig coordinates
(the forthcoming version actually drops the supercontigs and
just has clone, contig, and chromosomal coordinates, so this will
make life slightly easier).
Secondary issue: the Ensembl DAS server will call the gene
structure ENST00000265708, rather than ADAM2. This is because
the DAS protocol doesn't (to the best of my knowlege) support
synonyms. The Ensembl server uses ENST numbers as the primary
ID, on the basis that these are something consistent which every
single prediction has.
If you actually want to see it directly from Ensembl, try:
http://servlet.sanger.ac.uk:8080/das/ensembl930/features?segment=NT_034911;type=exon
A better bet would be to use some dedicated DAS client code, such
as that included in the BioJava library, to access this data.
This will handle all the sequence assembly issues for you, so you
can do:
SequenceDB ensemblDAS = new DASSequenceDB(
new URL("http://servlet.sanger.ac.uk:8080/das/ensembl930/")
);
Sequence chr = ensemblDAS.getSequence("8");
FeatureHolder someFeatures = chr.filter(
new FeatureFilter.OverlapsLocation(
new RangeLocation(38800000, 390000000)
)
);
And get back what you expect.
Thomas.