[Biojava-dev] DNASequence not being a bean

Andy Yates ayates at ebi.ac.uk
Tue May 11 16:16:25 UTC 2010


Hi guys,

Just in the middle of a reply & I saw this response :)

Firstly can I say I'm glad that you're trying to use this new code Trevor; it'll provide a very useful & driving use-case for the new API and I'm also glad that the Ensembl API is still being developed. It's bound to help us out in the long run.

So I see two ways around this. as Scooter said you can implement a SequenceProxyReader to do the work of getting & setting data to and from the Ensembl schema. Bulk loads of the data from Ensembl probably isn't a great idea as the sequence can be quite large depending on who has built the DB. If I remember correctly the Perl API does grabs of 250KB and puts them in a LRU cache. This would seem like my first port of call however this does still mean you need a way of doing bean style construction.

The other thing to take not to sue the existing hierarchy of objects and implement your own Sequence<NucleotideCompound>. Whilst this does mean whatever we do for working with features in BioJava you may only be able to leverage a portion of this. However using that interface an Generic type does give you complete freedom to do whatever you want to

Andy

On 11 May 2010, at 17:03, Scooter Willis wrote:

> Trevor
> 
> Andy Yates and I are knee deep in this at the moment and about to do a code check in to help clarify some of the concepts with classes that have better descriptive names. We will send out an email when we have all the issues resolved. SequenceProxyLoader concept has a name change to SequenceProxyReader to help leverage the abstract concepts of InputStreams or File Readers in Java. We realize it is confusing at the moment and I am working on examples to make it a little clearer.  I had an internal deadline for a project where I needed the code so haven't had time to do test cases and give examples. That deadline was this morning so now I can get back to balanced programming and get this finalized. 
> 
> The backing store and Sequence Reader are interfaces to the same concept. The Sequence is stored either on disk, as objects, in a string, at uniprot, in a database etc where the AbstractSequence doesn't need to know those details. Andy Yates is working on the storage mechanism to allow edits etc so he owns that portion of the code. I agree having an empty constructor is important to be a proper Bean. It does however place an additional programming contract burden on developers that are just getting started that the object is not valid if you don't call setSequence(). Easy enough to make the change if we don't get any feedback arguing against. 
> 
> I have added in parent child relationships where you can't can't create a TranscriptSequence as that must be a child of a GeneSequence which must be a child of a DNASequence. Only the DNASequence has a constructor exposed to pass in the actual Sequence Data.  Indexing for features is relative to the sequence storage. Instead of having Features as place holders we are trying to model the relationships of a Feature as a class with methods that correspond to the Feature. I don't want to expose the empty constructor on these types of sequences because I want to enforce the relationships if not lots of code checking has to occur in the base classes. For example when you add a CDS feature to a transcriptSequence you get back a CDSSequence that based on the parent child relationships when you actually ask for the sequence as a string the underlying code will find the parent with the backing store/Sequence reader and get the sub-sequence. Of course you can pass in a null sequence as!
>  the parent but I can also defend against that in one spot to enforce a contract for what is expected.
> 
> Can you describe the scenario where you would create Sequences and then at some point in the future give the actual SequenceSQLReader implementation to load the data. Your SequenceSQLReader could load the actual data at the time of instantiation or all calls that actually consume data make a check/call to the lazyload/init. Lots of ways to optimize based on your use case. I use in one example if(isInit() == false) init() in each of the methods which makes the lazy loading fairly easy. If you are working with sequence data at the chromosome level then it is optimal to send all calls to sql server because you are typically grabbing sub-sequence data that covers a feature so you don't take the local memory hit or load time of pulling all the data. The goal was to make it flexible and I think it will work for almost any case except for the one you pointed out where having an empty constructor for integration with bean aware tools. 
> 
> Scooter
> 
> 
> On May 11, 2010, at 10:45 AM, PATERSON Trevor wrote:
> 
>> Hi
>> 
>> as you may be aware I am working with Andy Law at Roslin  to kick off development of a Java version of the ensembl-api.
>> 
>> It makes good sense for us to integrate with the new Bio-Java code, however we have a few fundamental issues with the immutability of Sequence objects.
>> 
>> Because the BioJava Sequence objects require initialization at construction time with the actual sequence we will have problems using Ibatis mapping to create Sequences from SQL queries, and we will not easily be able to use LazyLoad to fetch the sequence only when we need it.
>> 
>> Ibatis uses bean setters to set properties on beans, which must have an empty constructor 
>> 
>> - so  setSequence(String seqString, CompoundSet<C> compoundSet) 
>> and setSequence(SequenceProxyLoader<C> proxyLoader, CompoundSet<C> compoundSet) would be very useful
>> 
>> alternatively we could hack round this if we could access the backing store in our own subclasses of DNASequence - i.e give the Sequence properties 'protected' rather than 'private' visibility.
>> 
>> 
>> In essence what we are wanting to do is implement DatasourceAware subclasses of BioJava DNASequence, which we can retrieve partially filled in from Ensembl, but only retrieve and set the actual DNASequence by LazyLoad when we want the DNA Sequence. 
>> 
>> It may be that we can implement this lazy loading using extensions of the SequenceProxyLoader interface (I am guessing that this is what it is for) but again we'd still benefit from accessing the backing store directly.
>> 
>> Obviously I am not up-to-speed with the design ideas behind the DNASequence Object and maybe I am barking  up the wrong tree in trying to subclass it to my own ends, so any hints and tips would be most welcome.
>> 
>> cheers for now
>> 
>> Trevor Paterson PhD
>> new email trevor.paterson at roslin.ed.ac.uk
>> 
>> Bioinformatics 
>> The Roslin Institute
>> The Royal (Dick) School of Veterinary Studies
>> University of Edinburgh
>> Scotland EH25 9PS
>> phone +44 (0)131 5274197
>> http://www.roslin.ed.ac.uk
>> http://www.resspecies.org
>> http://www.thearkdb.org
>> 
>> 
>> 
>> 
>> -- 
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> 
>> 
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the biojava-dev mailing list