[Biojava-dev] EnsemblApi use case for DNASequences

Thu May 13 13:35:15 UTC 2010

Trevor

I agree that going the ProxySequenceReader sounds like the best approach. If you look at SequenceFileProxyLoader in the source this is simple example I put in place where it is used to load large fasta files where you can look at the main method in FastaReader for example of how it is used. I parse the file to get the accession id and corresponding sequence offset in the file. This creates a sequence proxy that has all the details to get the actual sequence data if a call is ever made. This is a lazy load example where the sequence will get pulled into memory when a request is made for actual data. This way you can load very large Fasta files without a memory penalty. The other use case as you describe is to never load the full sequence into memory. You can come up with your own optimization steps where you can keep various sub-sequences as they are requested in memory and clear them from the cache based on a buffer size or a length of time since that region was last accessed. It all depends on your use case. If you are going to iterate through the sequence one position at a time then you would implement a read buffer for sequence data to avoid the overhead of multiple calls to the database. 

What we haven't had a chance to do yet is put examples in place that pull down features for a sequence. The following is what we have been kicking around as a design concept but lots of details to work out. You should be able to create sequences with the following examples.

DNASequence ncbiDNASequence = new DNASequence(new NCBISequenceReader("NC_000012.9")); 

with the ability to do this with no memory penalty or load time problem.

new DNASequence(new NCBISequenceReader("NC_000013.9"));
new DNASequence(new NCBISequenceReader("NC_000014.9"));
new DNASequence(new NCBISequenceReader("NC_000015.9"));
new DNASequence(new NCBISequenceReader("NC_000016.9"));

This would also be a use case

ProteinSequence rxraProteinSequence = new ProteinSequence(new UniprotSequenceReader("RXRA_HUMAN") );

The calls to the SequenceReader would know how to query the necessary details about the sequence as it relates to features etc. In the NCBISequenceReader example for the DNASequence set all the GeneSequences, TranscriptSequences, CDSSequences etc. where you delay loading sequence data as long as possible. If I wanted to work with promotor regions I could get the corresponding GeneSequence and then retrieve 5000 nucleotides in the 5' and 3' flanking region of the gene or grab all the sequence data located in the intron regions of the gene. Each request for sequence data would send the request to NCBI via the appropriate api(webservice,rest,url,toolkit) or in your example to the database. 

In the UniProtSequenceReader example I would want the following call to work

GeneSequence rxraGeneSequence = rxraProteinSequence.getGeneSequence() where UniprotSequenceReader based on knowing everything about RXRA_HUMAN would create a GeneSequence with a call to GeneSequence(new  NCBISequenceReader(rxra_gene_index)) and return a GeneSequence.

In the biojava3-genome package you can take a look at GeneFeatureHelper as the code I am using to take GFF/GTF/GFF3 features and create GeneSequences. Welcome any input or suggestion on method/class names etc. Andy and I bounced around how to model the relationships and would benefit from input from others. 

For your IBATIS example would the following work 

public class SequenceCreator{

	public SequenceCreator(){

	}
	public DNASequence getDNASequence(SequenceReaderProxy sequenceReaderProxy){
		return new DNASequence(sequenceReaderProxy);
	}

}

Thanks

Scooter

On May 13, 2010, at 8:44 AM, PATERSON Trevor wrote:

> That all sounds very useful - I'll get started now :)
> 
>> -----Original Message-----
>> From: Andy Yates [mailto:andyyatz at gmail.com] On Behalf Of Andy Yates
>> Sent: 13 May 2010 13:38
>> To: PATERSON Trevor
>> Cc: biojava-dev at lists.open-bio.org
>> Subject: Re: [Biojava-dev] EnsemblApi use case for DNASequences
>> 
>> I have to say that from working with Ensembl for the past 2
>> years hearing this is what it does to store sequence scares
>> **** out of me; you've really hit onto the hardest part of
>> the schema there.
>> 
>> As you said at the end of your email the best way to
>> accomplish this is by creating a SeqeunceProxyReader which
>> can do all this logic and lets you work with the "right"
>> objects and not have to re-implement that code. Now this
>> leaves a few alternatives to how you can represent this in
>> memory. We already have a 2bit implementation (will be called
>> TwoBitSequenceReader) for storing very large pieces of
>> Sequence but that only has support for ACGT and no support
>> for gaps or Ns. This could be extended to bring in support
>> for these as features or you could materialise that sequence
>> and then push it into another Sequence object I have been
>> working with (unchecked in atmo) which lets you join
>> Sequences together. This combined with a Sequence which
>> returns Compounds of a particular type e.g. Ns for any given
>> length would let you represent massive amounts of Sequence in
>> a very small amount of space. All of these updates will be in
>> place soon but I cannot say exactly when
>> 
>> The other option would be to cache chunks of the DNA indexed
>> by the seq_region_id. Pushing this into a LRU cache with soft
>> references (so they'll be cleaned up when you'd run out of
>> memory) could be a good way to go.
>> 
>> Either way the simple way really isn't the way to go IMHO; on
>> the flip side it would get you to a prototype quicker. Of
>> course this depends on what type of code you are writing. If
>> it is prototype code then great or if it's what normally
>> happens in Bioinformatics (we claim it's a prototype but in
>> reality it's the real deal) then go with the proper solution
>> 
>> Andy
>> 
>> On 13 May 2010, at 13:21, PATERSON Trevor wrote:
>> 
>>> Perhaps if I describe our initial use case and how we hope
>> to address
>>> it using Ibatis and BioJava API, I can get some pointers on
>> how much
>>> of this is already supported in BioJava, how much I am
>> going to need
>>> to implement and how I would be best doing this to generate
>> useful reusable code.
>>> 
>>> For each genome assembly build Ensembl stores different
>> levels of DNA
>>> sequence regions, it calls these coordinate_sytems (eg
>> clones, contigs, chromosomes etc).
>>> 
>>> For each genome assembly there is one 'TopLevel'
>> coordinate_system (eg chromosomes).
>>> And one 'SequenceLevel' coordinate_system  (eg contigs).
>>> 
>>> Each sequence region in the database records its length and
>>> coordinate_system BUT ONLY DNA regions which are at 'SequenceLevel'
>>> have actual DNA Sequence recorded, all other regions must
>> have their
>>> actual sequence recovered by 'projecting' from their level
>> to DNA regions at  'SequenceLevel'.
>>> 
>>> 
>>> so our initial use case is
>>> __________________________
>>> 
>>> 1. retrieve  Chromosome 25  for Chicken from the database.
>>> 
>>> What we get back are some properties (Name, coordinateSystemID and
>>> length)
>>> - and what we map this to in ibatis is an AssembledSequence
>> Object -
>>> with these properties
>>> 
>>> 2. fetch the sequence level assmbly details for this Chromosome.
>>> 
>>> We get back a table mapping from-to coordinates of the chromosome
>>> versus from to coordinates of the contigs that are at Sequence Level
>>> 
>>> diagramatically this looks like
>>> 
>>> 
>>> 
>> <--------------------------------------------------------------> chr25
>>> <-->  <---> <----> <--> <--> <----->     <--> <--> <---->
>> <---> contigs
>>>      <----->         <-->          <----> <------->
>>> 
>>> you will note that there are
>>>    - overlaps
>>>    - gaps
>>>    - potentially mismatches ( I am ignoring these for the moment)
>>> 
>>> 3. to get the DNA sequence, the ensembl perl api stitches
>> together the
>>> contigs into one 'Sequence' - filling gaps with gap
>> sequences of the
>>> correct length, so it generates an ordered list of mappings between
>>> the chromosome coordinate system  and coordinates of
>> contigs and gaps
>>> 
>>> <-->  <--->   ---> <-->    > <----->       ->        --->
>> <---> contigs
>>>          -->         <-->          <---->  ------->
>>>   nn            n         n       n                    nn      gaps
>>> 
>>> the perl api can then fetch the actual DNA sequence for any
>> region of
>>> the chromosome by looking up the contig regions it needs to
>> fetch the
>>> projected sequence of from this projection map.
>>> 
>>> Remember that chromosomes, contigs and gaps can all be very
>> long, or very short!
>>> 
>>> Our Java API
>>> ____________
>>> 
>>> I have mirrored what the perl api does
>>> 
>>> fetching a chromosome object - which Ibatis instantiates as an
>>> AssembledSequence object, which extends BioJava DNASequence
>> Object -
>>> but obviously just has a couple of new properties set at
>> this time (length, name, coord_system).
>>> 
>>> fetching an Assembly Object for this Chromosome Object -
>> this contains
>>> an ordered List of Mapping Objects which contain Source (ie the
>>> Chromosome), SourceCoordinates, Target (a new DNASequence
>> Object for
>>> each contig), TargetCoordinates
>>> 
>>> This Assembly Object can stitch together the Mapping Projection for
>>> all or some of the Chromosome, just like the perl API,
>> creating a new
>>> ordered List of Mapping Objects where the TargetCoordinates are
>>> alterred to remove overlaps, and new GapSequence objects have been
>>> inserted. [Gaps are problematic - do I really want
>> DNASequence Objects
>>> that contain N of length x, allowing me to use the Gaps
>> just like any
>>> other DNASequence but with all the overhead that invloves,
>> or should I
>>> just omit these mappings, or do i set the Target to Null in
>> a Mapping
>>> - and then I will need code to handle these wherever I use
>> sequences
>>> that contain null spacers - PERHAPS there is some
>> representation to handle Gaps generically in the BioJava API).
>>> 
>>> So now I am at the point of fetching actual DNA Sequence
>> for regions
>>> of interest on the Chromosome. This will invlove a look up of the
>>> stitched Mapping List for the contig regions to retieve
>> from Ensembl, and then setting the actual DNA sequence in these.
>>> 
>>> Hence my simplistic extension of DNA Sequences in the above
>> scenario
>>> falls over because of the Ibatis Bean requirement for setting
>>> properties directly on Objects, whivh i cant work around if
>> the DNASequence objects don't allow for setters.
>>> 
>>> I'm playing with lots of different ideas - possibly the simplest is
>>> just to forget about extending BioJava DNASequence for my ensembl
>>> objects (chromosomes, contigs)
>>> - and just create DNASequences for the 'real' Sequences that I get
>>> back as base strings from ensembl, which would then be
>> contained or referenced in my chromosomes/contig objects etc.
>>> I am sure however that this would mean that I end up having to
>>> reimplement much of the BioJava functionality in the new model
>>> Objects, whereas I was hoping to leverage this
>> transparently by simply extending DNASequence.
>>> 
>>> I guess one of my biggest concerns about extending BioJava to
>>> represent very big sequences is the potential overhead if i
>> have to instantiate them with backing stores containing the 'real'
>>> sequences - we are obvioulsy hoping to lazy load
>> (sub)sequences from
>>> ensembl when they are actually needed. We would have to be very
>>> careful to override all the methods that called back to the
>> backing store if we already had the information we needed or
>> could lazy load it, without grabbing the whole sequence.
>>> (e.g. the simple case of the chromosome - we have the
>> length from the
>>> initial query - so wouldn't want retrieve it from the
>> backing store).
>>> 
>>> So probably the correct way of doing things is to Implement
>> our own
>>> SequenceProxyReader for EnsemblAware Sequences to handle
>> lazy loads,
>>> which also provides all of the required backing store
>> functionality. As usual the correct way will turn out to be
>> the most work!
>>> 
>>> Cheers Trevor
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> 
>>> 
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev