[Biojava-dev] AbstractSequence

Andy Yates ayates at ebi.ac.uk
Tue May 4 16:10:35 UTC 2010


Okay I'll go ahead with the commit then and try to make sure it'll work as it gets committed in (to make sure I'm not going to annoy Andreas). 

If that kind of functionality is baked into the Sequence implementation then I agree there should be no reason why we should need it. The only reason I think it existed originally was to provide a quick way of returning sub-Sequences with relative indexes. The only thing to be aware of is that I extended ComplementSequenceView and ReversedSequenceView from the abstract classes. These should still exist in some capacity however their implementation means it would be easy to build a version which is _just_ a Sequence (there's nothing fancy it requires WRT SequenceViews just that it is a decorated sequence).

Wow okay evil stuff but yeah great if you can do it. Don't forget that we've got to think about circular co-orindates as well. So long as we can bake it in early it should be fine ... or we offer a CircularChromosome class (which would be used for mitochondria & plasmids amongst other things).

Okay I'm not a fan of that kind of naming of coordinate methods but I know why you're doing it :). It does make it clearer ... so long as there is nothing that lets you index in 0 base then I'm happy. If you're working with sequence then you should only ever be working in 1 based indexing.

Andy

On 4 May 2010, at 16:38, Scooter Willis wrote:

> Andy
> 
> I am knee deep in testing the changes so don't worry about check in of code and we can sort that out after I have it settled. I am trying to work myself through the need of SequenceView as an abstract concept where I think I have now built that into the default behavior based on a parent child relationships in the sequences. You can get a sub sequence of a DNA sequence of any type where it will have a start and end based on parent sequence. Won't work currently if you want to use relative indexing on a CDS region that is defined as an index relative to some high level parentDNA scaffold sequence. Easy enough to add relative indexing. Not sure if SequenceView will be required. 
> 
> I would like to be in a position that you can have a CDS sequence and ask for intron sequence data relative to the start of the CDS sequence with local coordinates. Gets complicated with negative strand where you want 3' or 5' data but if I close my eyes and put the code in the correct spot it should work. If we get all the relationships correct and name the method calls correctly should be straightforward. I also changed the getBegin and getEnd methods to getBioBegin() and getBioEnd() to be very clear we are using 1 based indexing for all apis.
> 
> Scooter
> 
> 
> On May 4, 2010, at 11:25 AM, Andy Yates wrote:
> 
>> Hi Scooter,
>> 
>> So by the sounds of things really we're saying here that we need a way of decorating the sequences accordingly to make them behave accordingly whilst letting someone insatiate an instance of a GeneSequence (for example) if required. That sounds a reasonable thing to do and the way that you've plotted it out sounds good as well. 
>> 
>> The walking code shouldn't be too hard since each level should be able to delegate if & when required (to all intents & purposes each sequence is a backing store to another sequence).
>> 
>> So from my side of things I've got some things to check in. I've relaxed the typing rules on Fasta parsing/writing because it wouldn't let you write a Sequence<NucelotideCompound> object back out which is poor (and happened only because of the changes I made). The only other code I have is an implementation of a 2bit sequence storage engine. Mostly because my group was trying their best to decipher a 2bit encoded sequence (as in encoded but not in the UCSC .2bit file format) and I decided to take our efforts into an example SequenceBackingStore. That said I'm a bit wary of committing it in since it is not something you _need_ and therefore would be better going into an extensions library (but we don't have one yet). What do you think?
>> 
>> Andy
>> 
>> On 4 May 2010, at 02:08, Scooter Willis wrote:
>> 
>>> Andy
>>> 
>>> Trying to finish up the code for the gff parser where we start with a scaffold/dna sequence and by mapping all the various CDS regions we can extract the encoded protein sequence handling negative strand and phase shift.
>>> 
>>> Each DNASequence can have a collection of genes.
>>> Each gene will have a collection of TranscriptionSequences
>>> Each TranscriptionSequence will have a collection of CDSSequences regions with strand and phase attributes.
>>>> From the CDSSequences owned by the parent GeneSequence we can pull out Intron/Exon sequences by superimposing all CDS regions which will then form an exon region. If not an exon region then the remainder is intron regions.
>>> 
>>> As it currently stands DNASequence would actually contain the sequence data where you can't create a GeneSequence without passing in a parent DNA sequence. GeneSequence,TranscriptSequence and CDSSequences all extend DNASequence but do not have a reference to backend store but for all modeling purposes they are DNASequences. When I call getSubSequence(begin,end) for the CDS sequence we don't handle the case where we will walk up the parents to find a valid backend store. I should be able to fix it with some minor changes in AbstractSequence and giving AbstractSequence a reference to a possible ParentSequence. 
>>> 
>>> Before making any changes I wanted to make sure you are all checked in so I don't run into major architectural changes on your end. I will be working on this genome related code for the next 30 days which will help me allocate to getting the core architecture full functional.
>>> 
>>> Thanks
>>> 
>>> Scooter
>> 
>> -- 
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the biojava-dev mailing list