[Biojava-l] K-mers
jitesh dundas
jbdundas at gmail.com
Fri Oct 29 10:04:35 UTC 2010
You are right again my friend.Definitely that would hang up my machine
with the xml file parsing activity.
This is about sequence alignment and related modules.
I will look at this today and send a fix on that.Hope that you can help.
PS: what about pattern matching in sequences?interesting to have in
biojava 3 ?
Regards,
JD
On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> Okay couple of points here:
>
> 1). Which biojava3 module? This sounds like something for the genomic module
> rather than core
>
> 2). It'll need some more work. I'm not happy about using the
> WindowedSequenceView in its current state. I think an alteration to avoid it
> making Lists would be a good idea (plus recent developments in the API as to
> its main use means this is a viable change). Also it should return the
> overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6
>
> Comments?
>
> Andy
>
> On 29 Oct 2010, at 10:12, jitesh dundas wrote:
>
>> Dear Friends,
>>
>> Thanks to Vishal & Andy for this. I actually needed this code too..
>> Vishal, I think Andy's suggestions may be a good option to include in
>> BioJava 3. Would you like to add this to the BioJava 3.
>>
>> Thanks again.
>>
>> Regards,
>> Jitesh Dundas
>>
>> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> Hi Vishal,
>>>
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at
>>> the moment. However it is possible to do it with BioJava3:
>>>
>>> public static void main(String[] args) {
>>> DNASequence d = new DNASequence("ATGATC");
>>> System.out.println("Non-Overlap");
>>> nonOverlap(d);
>>> System.out.println("Overlap");
>>> overlap(d);
>>> }
>>>
>>> public static final int KMER = 3;
>>>
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>> List<WindowedSequence<NucleotideCompound>> l =
>>> new ArrayList<WindowedSequence<NucleotideCompound>>();
>>> for(int i=1; i<=KMER; i++) {
>>> SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>> i, d.getLength());
>>> WindowedSequence<NucleotideCompound> w =
>>> new WindowedSequence<NucleotideCompound>(sub, KMER);
>>> l.add(w);
>>> }
>>>
>>> //Will return ATG, ATC, TGA & GAT
>>> for(WindowedSequence<NucleotideCompound> w: l) {
>>> for(List<NucleotideCompound> subList: w) {
>>> System.out.println(subList);
>>> }
>>> }
>>> }
>>>
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>> WindowedSequence<NucleotideCompound> w =
>>> new WindowedSequence<NucleotideCompound>(d, KMER);
>>> //Will return ATG & ATC
>>> for(List<NucleotideCompound> subList: w) {
>>> System.out.println(subList);
>>> }
>>> }
>>>
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>>
>>> Hope this helps,
>>>
>>> Andy
>>>
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>
>>>> Hi All,
>>>>
>>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>>> counts for every sequence in a fasta file. If something like this exists
>>>> it
>>>> would save me some time to write the code.
>>>>
>>>> Thanks,
>>>>
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>> --
>>> Andrew Yates Ensembl Genomes Engineer
>>> EMBL-EBI Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>
> --
> Andrew Yates Ensembl Genomes Engineer
> EMBL-EBI Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>
>
>
>
>
More information about the Biojava-l
mailing list