[Biojava-dev] Using DNASequence reverseComplement

Andy Yates ayates at ebi.ac.uk
Tue May 18 09:23:14 UTC 2010


At the moment if you ask for these they are returned as views on the original sequence; the idea of this was that Sequence is immutable. If mutability is brought in then that does raise some serious issues for the code but under the current situation then this is a safe & memory efficient way of getting this working. 

Andy

On 18 May 2010, at 10:10, LAW Andrew wrote:

> Another philosophical question from the sidelines.
> 
> Does a BJ3 object return a *new* object when asked for a reverseComplement/subSequence/whatever (I suspect from what I have read here and heard from Trevor that it does not)?
> 
> Should it?
> 
> 
> If I have DNA sequence object A and I ask it for the reverseComplement of the last 16 bases (call it sequence B) and then modify sequence A to change the last base (if that is possible?), what would happen to the sequence B? (And what do we think *should* happen to sequence B)?
> 
> 
> 
> On 18 May 2010, at 10:00, PATERSON Trevor wrote:
> 
>> Hi andy
>> 
>> I know this is all work in progress so I'm not too hung up on everything not working out the box..
>> 
>> What I am doing is stitching together contigs into an assembly, so I have an AssembledDNASequence Object, 
>> that extends DNASequence but defers its method to an Assembly Object which contains  a Map of DNASequences 
>> with ranges and orientations. This map is created from a single SQL call, and is populated with DNASequence 
>> Objects that lazy load their actual sequence data from the database when required
>> 
>> The only way I can check that the map is giving my the correct assembly is to print out 
>> the sequence as string of each component in order, 
>> ie 
>> Frag 1, from 1 to 200
>> Gap of 150 N
>> Frag 2, from 1 to 5000 REVERSED
>> Frag 3, from 150 to 6000
>> Gap of 2000 N
>> 
>> etc
>> 
>> I think your suggestion actually demonstrates the 'bug' ( or 'feature' ;)
>> 
>> seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString() 
>> returns the whole reverse complement - not just 1-4: AACCCGGGGTTTTT
>> Because the getSequenceAsString is looking at the parent of the view :
>> 
>> Something even more horrible happens if you try to bound that query
>> seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString(1,4,Strand.POSITIVE)
>> Returns AAAA ( ie this is 1-4 of the parent, not even the reverse complement.)
>> And perhaps more interestingly
>> seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString(1,4,Strand.NEGATIVE))
>> Also returns AAAA (which I just plain can't understand..)
>> 
>> ___________________________________________________________________________________
>> 
>> Anyway - I am probably at a stage where I don't want to do any more development on this at the moment...
>> I am reasonably happy that an EnsemblAPI will be able to use the BioJava Sequence objects down the road.
>> 
>> I have made an EnsemblDNASequenceReader that extends and implements ProxySequenceReader
>> This  is used to create a DataSourceAwareDNASequence and responsible for lazy loading the actual sequence data 
>> from Ensembl when required
>> 
>> To do this I needed to give the AbstractSequences's (SequenceReader) sequenceStorage property and its getter protected rather 
>> than private access, because there needs to be exchange of info from one to the other
>> 
>> I will need to implement some sort of chunking and caching for retrieval of larger sequences, and maybe even pass
>> all the methods to a buffered reader for large sequences, but I'm not going to worry about this at the moment.
>> 
>> There are plenty of other things for me to protoype in ensembl at the moment!
>> 
>> Cheers Trevor
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> -----Original Message-----
>>> From: Andy Yates [mailto:andyyatz at gmail.com] On Behalf Of Andy Yates
>>> Sent: 17 May 2010 17:08
>>> To: PATERSON Trevor
>>> Cc: biojava-dev at lists.open-bio.org
>>> Subject: Re: [Biojava-dev] Using DNASequence reverseComplement
>>> 
>>> Hi Trevor,
>>> 
>>> You've stumbled right into something myself & Scooter are 
>>> trying to clean-up now. The assumption I had originally made 
>>> is that all operations on things like getSequenceAsString() 
>>> would go via the view since that's where the logic is located 
>>> for both the reversing & the complementing code. That call 
>>> now delegates onto the backing store and not the view which 
>>> means you get these very odd results happening.
>>> 
>>> For the moment I think the following code would do what 
>>> you're expecting:
>>> 
>>> seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString()
>>> 
>>> It's annoying because you're in the reverse coordinate system 
>>> you've got to reverse the original coordinates so asking for 
>>> position 11,14 just isn't going to work. The other way of 
>>> working with this would be to construct the views yourself 
>>> and pass in a subsequence of the original sequence i.e.
>>> 
>>> new ReversedSequenceView<NucleotideCompound>(new 
>>> ComplementSequenceView<NucleotideCompound>(seq.getSubSequence(11,14));
>>> 
>>> This is really a problem with the erasure of the Sequence 
>>> types from DNASequence. If DNASequence returned the same type 
>>> from its subsequence method then you would just call revComp 
>>> on that and it would have been fine.
>>> 
>>> The thing to take away from are:
>>> 
>>> * getSequenceAsString(Integer, Integer, Strand) is not well 
>>> supported atmo
>>> 
>>> * So long as we are sure it should remain it will be
>>> 
>>> * There should be no reason to materialise the Sequence into 
>>> a String to get a part of the API working. If there is then 
>>> we've messed up
>>> 
>>> Andy
>>> 
>>> p.s. The strand stuff is confusing; originally it was meant 
>>> to be +ve & -ve strands but assumed that the Sequence you had 
>>> was always on the +ve strand. Eventually the meaning will 
>>> come back but will require the methods to be more aware of 
>>> the strand DNA is on to make the right call about what you 
>>> want to do. This all ties in with circular genomes support 
>>> and locations 
>>> 
>>> On 17 May 2010, at 16:24, PATERSON Trevor wrote:
>>> 
>>>> Sorry for raising that behemoth earlier..
>>>> 
>>>> I have a separate problem with the DNASequence API -
>>>> 
>>>> Probably I just don't understand how to use the View objects
>>>> 
>>>> 
>>>> If I make a DNASequence
>>>> 
>>>> DNASequence seq = new DNASequence("AAAAACCCCGGGTT");
>>>> 
>>>> i.e. length = 14,
>>>> 
>>>> I might reasonably want to get the ReverseComplement of 
>>> bases 11-14, which should 'be' "AACC"
>>>> 
>>>> But I cannot manage to get this in one easy step....
>>>> 
>>>> seq.toString(): AAAAACCCCGGGTT --> FINE
>>>> 
>>>> seq.getReverseComplement().getSequenceAsString(): 
>>> AACCCGGGGTTTTT --> 
>>>> FINE
>>>> 
>>>> But when I try to use bounds on this complement - methods 
>>> refer back 
>>>> to the original seq's iterator, not the complement
>>>> 
>>>> 
>>> seq.getReverseComplement().getSequenceAsString(11,14,Strand.PO
>>> SITIVE): GGTT 
>>>> 	i.e the same as seq.getSequenceAsString(11,14,Strand.POSITIVE)
>>>> 
>>> seq.getReverseComplement().getSequenceAsString(11,14,Strand.NE
>>> GATIVE): TTGG
>>>> 	i.e the same as seq.getSequenceAsString(11,14,Strand.NEGATIVE)
>>>> 
>>>> Is this the desired behaviour? How would I get the desired 
>>> reverseComplement fragment?
>>>> 
>>>> The only obvious way that I can see is
>>>> 
>>>>              DNASequence subseq = new 
>>> DNASequence(seq.getSequenceAsString(11, 14, Strand.POSITIVE));
>>>>              System.out.println(""+ 
>>>> subseq.getReverseComplement().getSequenceAsString());
>>>> 
>>>> 
>>> ______________________________________________________________________
>>>> _______________________
>>>> 
>>>> On a related point I was mightily confused by the 
>>>> Strand.POSITIVE/Strand.NEGATIVE enumeration
>>>> 
>>>> I was naively interpreting them to refer to the strand of the DNA: 
>>>> Whereas they infact refer to the directionality of the Iterator *on 
>>>> the same Strand*
>>>> 
>>>> A better name might be Direction:FORWARDS/Direction.BACKWARDS?
>>>> Positive and negative strand has loaded biological meaning 
>>> for newbies 
>>>> like me ( sense versus antisense ) So I made the assumption that a 
>>>> Strand.NEGATIVE call would itself reverseComplement
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in 
>>>> Scotland, with registration number SC005336.
>>>> 
>>>> 
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>> 
>>> -- 
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>> 
>>> 
>>> 
>>> 
>>> 
>> -- 
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> 
>> 
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> Later,
> 
> Andy
> --------
> Yada, yada, yada...
> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336
> Disclaimer: This e-mail and any attachments are confidential and intended solely for the use of the recipient(s) to whom they are addressed. If you have received it in error, please destroy all copies and inform the sender.
> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the biojava-dev mailing list