[Biojava-dev] EnsemblApi use case for DNASequences

Thu May 13 12:38:10 UTC 2010

I have to say that from working with Ensembl for the past 2 years hearing this is what it does to store sequence scares **** out of me; you've really hit onto the hardest part of the schema there.

As you said at the end of your email the best way to accomplish this is by creating a SeqeunceProxyReader which can do all this logic and lets you work with the "right" objects and not have to re-implement that code. Now this leaves a few alternatives to how you can represent this in memory. We already have a 2bit implementation (will be called TwoBitSequenceReader) for storing very large pieces of Sequence but that only has support for ACGT and no support for gaps or Ns. This could be extended to bring in support for these as features or you could materialise that sequence and then push it into another Sequence object I have been working with (unchecked in atmo) which lets you join Sequences together. This combined with a Sequence which returns Compounds of a particular type e.g. Ns for any given length would let you represent massive amounts of Sequence in a very small amount of space. All of these updates will be in place soon but I cannot say exactly when

The other option would be to cache chunks of the DNA indexed by the seq_region_id. Pushing this into a LRU cache with soft references (so they'll be cleaned up when you'd run out of memory) could be a good way to go.

Either way the simple way really isn't the way to go IMHO; on the flip side it would get you to a prototype quicker. Of course this depends on what type of code you are writing. If it is prototype code then great or if it's what normally happens in Bioinformatics (we claim it's a prototype but in reality it's the real deal) then go with the proper solution

Andy

On 13 May 2010, at 13:21, PATERSON Trevor wrote:

> Perhaps if I describe our initial use case and how we hope to address it 
> using Ibatis and BioJava API, I can get some pointers on how much of this 
> is already supported in BioJava, how much I am going to need to implement 
> and how I would be best doing this to generate useful reusable code.
> 
> For each genome assembly build Ensembl stores different levels of DNA 
> sequence regions, it calls these coordinate_sytems (eg clones, contigs, chromosomes etc).
> 
> For each genome assembly there is one 'TopLevel' coordinate_system (eg chromosomes).
> And one 'SequenceLevel' coordinate_system  (eg contigs). 
> 
> Each sequence region in the database records its length and coordinate_system 
> BUT ONLY DNA regions which are at 'SequenceLevel' have actual DNA Sequence 
> recorded, all other regions must have their actual sequence recovered by 
> 'projecting' from their level to DNA regions at  'SequenceLevel'.
> 
> 
> so our initial use case is
> __________________________ 
> 
> 1. retrieve  Chromosome 25  for Chicken from the database.
> 
> What we get back are some properties (Name, coordinateSystemID and length) 
> - and what we map this to in ibatis is an AssembledSequence Object - with these properties
> 
> 2. fetch the sequence level assmbly details for this Chromosome.
> 
> We get back a table mapping from-to coordinates of the chromosome versus from to 
> coordinates of the contigs that are at Sequence Level
> 
> diagramatically this looks like
> 
> 
> <--------------------------------------------------------------> chr25
> <-->  <---> <----> <--> <--> <----->     <--> <--> <---->  <---> contigs
>       <----->         <-->          <----> <-------> 
> 
> you will note that there are
> 	- overlaps
> 	- gaps 
> 	- potentially mismatches ( I am ignoring these for the moment)
> 
> 3. to get the DNA sequence, the ensembl perl api stitches together the contigs into
> one 'Sequence' - filling gaps with gap sequences of the correct length, so it generates 
> an ordered list of mappings between the chromosome coordinate system  and coordinates of 
> contigs and gaps
> 
> <-->  <--->   ---> <-->    > <----->       ->        --->  <---> contigs
>           -->         <-->          <---->  -------> 
>    nn            n         n       n                    nn      gaps
> 
> the perl api can then fetch the actual DNA sequence for any region of the chromosome
> by looking up the contig regions it needs to fetch the projected sequence of from this
> projection map.
> 
> Remember that chromosomes, contigs and gaps can all be very long, or very short!
> 
> Our Java API
> ____________
> 
> I have mirrored what the perl api does
> 
> fetching a chromosome object - which Ibatis instantiates as an AssembledSequence object, 
> which extends BioJava DNASequence Object - but obviously just has a couple of new properties 
> set at this time (length, name, coord_system).
> 
> fetching an Assembly Object for this Chromosome Object - this contains an ordered List of Mapping 
> Objects which contain Source (ie the Chromosome), SourceCoordinates, Target (a new DNASequence Object 
> for each contig), TargetCoordinates
> 
> This Assembly Object can stitch together the Mapping Projection for all or some of the 
> Chromosome, just like the perl API, creating a new ordered List of Mapping Objects where 
> the TargetCoordinates are alterred to remove overlaps, and new GapSequence objects have been
> inserted. [Gaps are problematic - do I really want DNASequence Objects that contain N of 
> length x, allowing me to use the Gaps just like any other DNASequence but with all the overhead 
> that invloves, or should I just omit these mappings, or do i set the Target to Null in a Mapping
> - and then I will need code to handle these wherever I use sequences that contain null spacers - 
> PERHAPS there is some representation to handle Gaps generically in the BioJava API).
> 
> So now I am at the point of fetching actual DNA Sequence for regions of interest on the 
> Chromosome. This will invlove a look up of the stitched Mapping List for the contig regions 
> to retieve from Ensembl, and then setting the actual DNA sequence in these.
> 
> Hence my simplistic extension of DNA Sequences in the above scenario falls over because of the
> Ibatis Bean requirement for setting properties directly on Objects, whivh i cant work around if 
> the DNASequence objects don't allow for setters.
> 
> I'm playing with lots of different ideas - possibly the simplest is just to forget about 
> extending BioJava DNASequence for my ensembl objects (chromosomes, contigs) 
> - and just create DNASequences for the 'real' Sequences that I get back as base strings 
> from ensembl, which would then be contained or referenced in my chromosomes/contig objects etc. 
> I am sure however that this would mean that I end up having to 
> reimplement much of the BioJava functionality in the new model Objects, whereas I was hoping 
> to leverage this transparently by simply extending DNASequence.
> 
> I guess one of my biggest concerns about extending BioJava to represent very big sequences is 
> the potential overhead if i have to instantiate them with backing stores containing the 'real' 
> sequences - we are obvioulsy hoping to lazy load (sub)sequences from ensembl when they are actually 
> needed. We would have to be very careful to override all the methods that called back to the backing 
> store if we already had the information we needed or could lazy load it, without grabbing the whole sequence.
> (e.g. the simple case of the chromosome - we have the length from the initial query - so wouldn't want 
> retrieve it from the backing store).
> 
> So probably the correct way of doing things is to Implement our own  SequenceProxyReader for EnsemblAware 
> Sequences to handle lazy loads, which also provides all of the required backing store functionality. As
> usual the correct way will turn out to be the most work!
> 
> Cheers Trevor
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/