[Biojava-dev] Using DNASequence reverseComplement

PATERSON Trevor trevor.paterson at roslin.ed.ac.uk
Tue May 18 09:00:22 UTC 2010


Hi andy

I know this is all work in progress so I'm not too hung up on everything not working out the box..

What I am doing is stitching together contigs into an assembly, so I have an AssembledDNASequence Object, 
that extends DNASequence but defers its method to an Assembly Object which contains  a Map of DNASequences 
with ranges and orientations. This map is created from a single SQL call, and is populated with DNASequence 
Objects that lazy load their actual sequence data from the database when required

The only way I can check that the map is giving my the correct assembly is to print out 
the sequence as string of each component in order, 
ie 
Frag 1, from 1 to 200
Gap of 150 N
Frag 2, from 1 to 5000 REVERSED
Frag 3, from 150 to 6000
Gap of 2000 N

etc

I think your suggestion actually demonstrates the 'bug' ( or 'feature' ;)

seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString() 
returns the whole reverse complement - not just 1-4: AACCCGGGGTTTTT
Because the getSequenceAsString is looking at the parent of the view :

Something even more horrible happens if you try to bound that query
seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString(1,4,Strand.POSITIVE)
Returns AAAA ( ie this is 1-4 of the parent, not even the reverse complement.)
And perhaps more interestingly
seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString(1,4,Strand.NEGATIVE))
Also returns AAAA (which I just plain can't understand..)

___________________________________________________________________________________

Anyway - I am probably at a stage where I don't want to do any more development on this at the moment...
I am reasonably happy that an EnsemblAPI will be able to use the BioJava Sequence objects down the road.

I have made an EnsemblDNASequenceReader that extends and implements ProxySequenceReader
This  is used to create a DataSourceAwareDNASequence and responsible for lazy loading the actual sequence data 
from Ensembl when required

To do this I needed to give the AbstractSequences's (SequenceReader) sequenceStorage property and its getter protected rather 
than private access, because there needs to be exchange of info from one to the other

I will need to implement some sort of chunking and caching for retrieval of larger sequences, and maybe even pass
all the methods to a buffered reader for large sequences, but I'm not going to worry about this at the moment.

There are plenty of other things for me to protoype in ensembl at the moment!

Cheers Trevor








> -----Original Message-----
> From: Andy Yates [mailto:andyyatz at gmail.com] On Behalf Of Andy Yates
> Sent: 17 May 2010 17:08
> To: PATERSON Trevor
> Cc: biojava-dev at lists.open-bio.org
> Subject: Re: [Biojava-dev] Using DNASequence reverseComplement
> 
> Hi Trevor,
> 
> You've stumbled right into something myself & Scooter are 
> trying to clean-up now. The assumption I had originally made 
> is that all operations on things like getSequenceAsString() 
> would go via the view since that's where the logic is located 
> for both the reversing & the complementing code. That call 
> now delegates onto the backing store and not the view which 
> means you get these very odd results happening.
> 
> For the moment I think the following code would do what 
> you're expecting:
> 
> seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString()
> 
> It's annoying because you're in the reverse coordinate system 
> you've got to reverse the original coordinates so asking for 
> position 11,14 just isn't going to work. The other way of 
> working with this would be to construct the views yourself 
> and pass in a subsequence of the original sequence i.e.
> 
> new ReversedSequenceView<NucleotideCompound>(new 
> ComplementSequenceView<NucleotideCompound>(seq.getSubSequence(11,14));
> 
> This is really a problem with the erasure of the Sequence 
> types from DNASequence. If DNASequence returned the same type 
> from its subsequence method then you would just call revComp 
> on that and it would have been fine.
> 
> The thing to take away from are:
> 
> * getSequenceAsString(Integer, Integer, Strand) is not well 
> supported atmo
> 
> * So long as we are sure it should remain it will be
> 
> * There should be no reason to materialise the Sequence into 
> a String to get a part of the API working. If there is then 
> we've messed up
> 
> Andy
> 
> p.s. The strand stuff is confusing; originally it was meant 
> to be +ve & -ve strands but assumed that the Sequence you had 
> was always on the +ve strand. Eventually the meaning will 
> come back but will require the methods to be more aware of 
> the strand DNA is on to make the right call about what you 
> want to do. This all ties in with circular genomes support 
> and locations 
> 
> On 17 May 2010, at 16:24, PATERSON Trevor wrote:
> 
> > Sorry for raising that behemoth earlier..
> > 
> > I have a separate problem with the DNASequence API -
> > 
> > Probably I just don't understand how to use the View objects
> > 
> > 
> > If I make a DNASequence
> > 
> > DNASequence seq = new DNASequence("AAAAACCCCGGGTT");
> > 
> > i.e. length = 14,
> > 
> > I might reasonably want to get the ReverseComplement of 
> bases 11-14, which should 'be' "AACC"
> > 
> > But I cannot manage to get this in one easy step....
> > 
> > seq.toString(): AAAAACCCCGGGTT --> FINE
> > 
> > seq.getReverseComplement().getSequenceAsString(): 
> AACCCGGGGTTTTT --> 
> > FINE
> > 
> > But when I try to use bounds on this complement - methods 
> refer back 
> > to the original seq's iterator, not the complement
> > 
> > 
> seq.getReverseComplement().getSequenceAsString(11,14,Strand.PO
> SITIVE): GGTT 
> > 	i.e the same as seq.getSequenceAsString(11,14,Strand.POSITIVE)
> > 
> seq.getReverseComplement().getSequenceAsString(11,14,Strand.NE
> GATIVE): TTGG
> > 	i.e the same as seq.getSequenceAsString(11,14,Strand.NEGATIVE)
> > 
> > Is this the desired behaviour? How would I get the desired 
> reverseComplement fragment?
> > 
> > The only obvious way that I can see is
> > 
> >                DNASequence subseq = new 
> DNASequence(seq.getSequenceAsString(11, 14, Strand.POSITIVE));
> >                System.out.println(""+ 
> > subseq.getReverseComplement().getSequenceAsString());
> > 
> > 
> ______________________________________________________________________
> > _______________________
> > 
> > On a related point I was mightily confused by the 
> > Strand.POSITIVE/Strand.NEGATIVE enumeration
> > 
> > I was naively interpreting them to refer to the strand of the DNA: 
> > Whereas they infact refer to the directionality of the Iterator *on 
> > the same Strand*
> > 
> > A better name might be Direction:FORWARDS/Direction.BACKWARDS?
> > Positive and negative strand has loaded biological meaning 
> for newbies 
> > like me ( sense versus antisense ) So I made the assumption that a 
> > Strand.NEGATIVE call would itself reverseComplement
> > --
> > The University of Edinburgh is a charitable body, registered in 
> > Scotland, with registration number SC005336.
> > 
> > 
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> -- 
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.





More information about the biojava-dev mailing list