[Biojava-dev] AbstractSequence

Tue May 4 15:25:44 UTC 2010

Hi Scooter,

So by the sounds of things really we're saying here that we need a way of decorating the sequences accordingly to make them behave accordingly whilst letting someone insatiate an instance of a GeneSequence (for example) if required. That sounds a reasonable thing to do and the way that you've plotted it out sounds good as well. 

The walking code shouldn't be too hard since each level should be able to delegate if & when required (to all intents & purposes each sequence is a backing store to another sequence).

So from my side of things I've got some things to check in. I've relaxed the typing rules on Fasta parsing/writing because it wouldn't let you write a Sequence<NucelotideCompound> object back out which is poor (and happened only because of the changes I made). The only other code I have is an implementation of a 2bit sequence storage engine. Mostly because my group was trying their best to decipher a 2bit encoded sequence (as in encoded but not in the UCSC .2bit file format) and I decided to take our efforts into an example SequenceBackingStore. That said I'm a bit wary of committing it in since it is not something you _need_ and therefore would be better going into an extensions library (but we don't have one yet). What do you think?

Andy

On 4 May 2010, at 02:08, Scooter Willis wrote:

> Andy
> 
> Trying to finish up the code for the gff parser where we start with a scaffold/dna sequence and by mapping all the various CDS regions we can extract the encoded protein sequence handling negative strand and phase shift.
> 
> Each DNASequence can have a collection of genes.
> Each gene will have a collection of TranscriptionSequences
> Each TranscriptionSequence will have a collection of CDSSequences regions with strand and phase attributes.
>> From the CDSSequences owned by the parent GeneSequence we can pull out Intron/Exon sequences by superimposing all CDS regions which will then form an exon region. If not an exon region then the remainder is intron regions.
> 
> As it currently stands DNASequence would actually contain the sequence data where you can't create a GeneSequence without passing in a parent DNA sequence. GeneSequence,TranscriptSequence and CDSSequences all extend DNASequence but do not have a reference to backend store but for all modeling purposes they are DNASequences. When I call getSubSequence(begin,end) for the CDS sequence we don't handle the case where we will walk up the parents to find a valid backend store. I should be able to fix it with some minor changes in AbstractSequence and giving AbstractSequence a reference to a possible ParentSequence. 
> 
> Before making any changes I wanted to make sure you are all checked in so I don't run into major architectural changes on your end. I will be working on this genome related code for the next 30 days which will help me allocate to getting the core architecture full functional.
> 
> Thanks
> 
> Scooter

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/