[Biojava-l] K-mers

jitesh dundas jbdundas at gmail.com
Fri Oct 29 10:04:35 UTC 2010


You are right again my friend.Definitely that would hang up my machine
with the xml file parsing activity.

This is about sequence alignment and related modules.

I will look at this today and send a fix on that.Hope that you can help.

PS: what about pattern matching in sequences?interesting  to have in
biojava 3 ?

Regards,
JD

On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> Okay couple of points here:
>
> 1). Which biojava3 module? This sounds like something for the genomic module
> rather than core
>
> 2). It'll need some more work. I'm not happy about using the
> WindowedSequenceView in its current state. I think an alteration to avoid it
> making Lists would be a good idea (plus recent developments in the API as to
> its main use means this is a viable change). Also it should return the
> overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6
>
> Comments?
>
> Andy
>
> On 29 Oct 2010, at 10:12, jitesh dundas wrote:
>
>> Dear Friends,
>>
>> Thanks to Vishal & Andy for this. I actually needed this code too..
>> Vishal, I think Andy's suggestions may be a good option to include in
>> BioJava 3. Would you like to add this to the BioJava 3.
>>
>> Thanks again.
>>
>> Regards,
>> Jitesh Dundas
>>
>> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> Hi Vishal,
>>>
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at
>>> the moment. However it is possible to do it with BioJava3:
>>>
>>> public static void main(String[] args) {
>>>    DNASequence d = new DNASequence("ATGATC");
>>>    System.out.println("Non-Overlap");
>>>    nonOverlap(d);
>>>    System.out.println("Overlap");
>>>    overlap(d);
>>> }
>>>
>>> public static final int KMER = 3;
>>>
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>    List<WindowedSequence<NucleotideCompound>> l =
>>>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>    for(int i=1; i<=KMER; i++) {
>>>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>                i, d.getLength());
>>>        WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>        l.add(w);
>>>    }
>>>
>>>    //Will return ATG, ATC, TGA & GAT
>>>    for(WindowedSequence<NucleotideCompound> w: l) {
>>>        for(List<NucleotideCompound> subList: w) {
>>>            System.out.println(subList);
>>>        }
>>>    }
>>> }
>>>
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>    WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(d, KMER);
>>>    //Will return ATG & ATC
>>>    for(List<NucleotideCompound> subList: w) {
>>>        System.out.println(subList);
>>>    }
>>> }
>>>
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>>
>>> Hope this helps,
>>>
>>> Andy
>>>
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>
>>>> Hi All,
>>>>
>>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>>> counts for every sequence in a fasta file. If something like this exists
>>>> it
>>>> would save me some time to write the code.
>>>>
>>>> Thanks,
>>>>
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>



More information about the Biojava-l mailing list