[Biojava-dev] EnsemblApi use case for DNASequences

Thu May 13 12:21:47 UTC 2010

Perhaps if I describe our initial use case and how we hope to address it 
using Ibatis and BioJava API, I can get some pointers on how much of this 
is already supported in BioJava, how much I am going to need to implement 
and how I would be best doing this to generate useful reusable code.

For each genome assembly build Ensembl stores different levels of DNA 
sequence regions, it calls these coordinate_sytems (eg clones, contigs, chromosomes etc).

For each genome assembly there is one 'TopLevel' coordinate_system (eg chromosomes).
And one 'SequenceLevel' coordinate_system  (eg contigs). 

Each sequence region in the database records its length and coordinate_system 
BUT ONLY DNA regions which are at 'SequenceLevel' have actual DNA Sequence 
recorded, all other regions must have their actual sequence recovered by 
'projecting' from their level to DNA regions at  'SequenceLevel'.

so our initial use case is
__________________________ 

1. retrieve  Chromosome 25  for Chicken from the database.

What we get back are some properties (Name, coordinateSystemID and length) 
- and what we map this to in ibatis is an AssembledSequence Object - with these properties

2. fetch the sequence level assmbly details for this Chromosome.

We get back a table mapping from-to coordinates of the chromosome versus from to 
coordinates of the contigs that are at Sequence Level

diagramatically this looks like

<--------------------------------------------------------------> chr25
<-->  <---> <----> <--> <--> <----->     <--> <--> <---->  <---> contigs
       <----->         <-->          <----> <-------> 

you will note that there are
	- overlaps
	- gaps 
	- potentially mismatches ( I am ignoring these for the moment)

3. to get the DNA sequence, the ensembl perl api stitches together the contigs into
one 'Sequence' - filling gaps with gap sequences of the correct length, so it generates 
an ordered list of mappings between the chromosome coordinate system  and coordinates of 
contigs and gaps

<-->  <--->   ---> <-->    > <----->       ->        --->  <---> contigs
           -->         <-->          <---->  -------> 
    nn            n         n       n                    nn      gaps

the perl api can then fetch the actual DNA sequence for any region of the chromosome
by looking up the contig regions it needs to fetch the projected sequence of from this
projection map.

Remember that chromosomes, contigs and gaps can all be very long, or very short!

Our Java API
____________

I have mirrored what the perl api does

fetching a chromosome object - which Ibatis instantiates as an AssembledSequence object, 
which extends BioJava DNASequence Object - but obviously just has a couple of new properties 
set at this time (length, name, coord_system).

fetching an Assembly Object for this Chromosome Object - this contains an ordered List of Mapping 
Objects which contain Source (ie the Chromosome), SourceCoordinates, Target (a new DNASequence Object 
for each contig), TargetCoordinates

This Assembly Object can stitch together the Mapping Projection for all or some of the 
Chromosome, just like the perl API, creating a new ordered List of Mapping Objects where 
the TargetCoordinates are alterred to remove overlaps, and new GapSequence objects have been
inserted. [Gaps are problematic - do I really want DNASequence Objects that contain N of 
length x, allowing me to use the Gaps just like any other DNASequence but with all the overhead 
that invloves, or should I just omit these mappings, or do i set the Target to Null in a Mapping
 - and then I will need code to handle these wherever I use sequences that contain null spacers - 
PERHAPS there is some representation to handle Gaps generically in the BioJava API).

So now I am at the point of fetching actual DNA Sequence for regions of interest on the 
Chromosome. This will invlove a look up of the stitched Mapping List for the contig regions 
to retieve from Ensembl, and then setting the actual DNA sequence in these.

Hence my simplistic extension of DNA Sequences in the above scenario falls over because of the
Ibatis Bean requirement for setting properties directly on Objects, whivh i cant work around if 
the DNASequence objects don't allow for setters.

I'm playing with lots of different ideas - possibly the simplest is just to forget about 
extending BioJava DNASequence for my ensembl objects (chromosomes, contigs) 
- and just create DNASequences for the 'real' Sequences that I get back as base strings 
from ensembl, which would then be contained or referenced in my chromosomes/contig objects etc. 
I am sure however that this would mean that I end up having to 
reimplement much of the BioJava functionality in the new model Objects, whereas I was hoping 
to leverage this transparently by simply extending DNASequence.

I guess one of my biggest concerns about extending BioJava to represent very big sequences is 
the potential overhead if i have to instantiate them with backing stores containing the 'real' 
sequences - we are obvioulsy hoping to lazy load (sub)sequences from ensembl when they are actually 
needed. We would have to be very careful to override all the methods that called back to the backing 
store if we already had the information we needed or could lazy load it, without grabbing the whole sequence.
(e.g. the simple case of the chromosome - we have the length from the initial query - so wouldn't want 
retrieve it from the backing store).

So probably the correct way of doing things is to Implement our own  SequenceProxyReader for EnsemblAware 
Sequences to handle lazy loads, which also provides all of the required backing store functionality. As
usual the correct way will turn out to be the most work!

Cheers Trevor
-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.