[Biojava-l] K-mers

jitesh dundas jbdundas at gmail.com
Fri Oct 29 09:12:53 UTC 2010


Dear Friends,

Thanks to Vishal & Andy for this. I actually needed this code too..
Vishal, I think Andy's suggestions may be a good option to include in
BioJava 3. Would you like to add this to the BioJava 3.

Thanks again.

Regards,
Jitesh Dundas

On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Vishal,
>
> As far as I am aware there is nothing which will generate them in BioJava at
> the moment. However it is possible to do it with BioJava3:
>
> public static void main(String[] args) {
>     DNASequence d = new DNASequence("ATGATC");
>     System.out.println("Non-Overlap");
>     nonOverlap(d);
>     System.out.println("Overlap");
>     overlap(d);
> }
>
> public static final int KMER = 3;
>
> //Generate triplets overlapping
> public static void overlap(Sequence<NucleotideCompound> d) {
>     List<WindowedSequence<NucleotideCompound>> l =
>             new ArrayList<WindowedSequence<NucleotideCompound>>();
>     for(int i=1; i<=KMER; i++) {
>         SequenceView<NucleotideCompound> sub = d.getSubSequence(
>                 i, d.getLength());
>         WindowedSequence<NucleotideCompound> w =
>             new WindowedSequence<NucleotideCompound>(sub, KMER);
>         l.add(w);
>     }
>
>     //Will return ATG, ATC, TGA & GAT
>     for(WindowedSequence<NucleotideCompound> w: l) {
>         for(List<NucleotideCompound> subList: w) {
>             System.out.println(subList);
>         }
>     }
> }
>
> //Generate triplet Compound lists non-overlapping
> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>     WindowedSequence<NucleotideCompound> w =
>             new WindowedSequence<NucleotideCompound>(d, KMER);
>     //Will return ATG & ATC
>     for(List<NucleotideCompound> subList: w) {
>         System.out.println(subList);
>     }
> }
>
> The disadvantage of all of these solutions is that they generate lists of
> Compounds so kmer generation can/will be a memory intensive operation. This
> does mean it has to be since sub sequences are thin wrappers around an
> underlying sequence. Also the overlap solution is non-optimal since it
> iterates through each window rather than stepping through delegating onto
> each base in turn (hence why we get ATG & ATC before TGA)
>
> As for unique k-mers that's something which would require a bit more
> engineering & would be better suited to a solution built around a Trie
> (prefix tree).
>
> Hope this helps,
>
> Andy
>
> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>
>> Hi All,
>>
>> I had a quick question: Does Biojava have a method to generate k-mers or
>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>> counts for every sequence in a fasta file. If something like this exists
>> it
>> would save me some time to write the code.
>>
>> Thanks,
>>
>> Vishal
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



More information about the Biojava-l mailing list