[Biojava-l] K-mers

Andy Yates ayates at ebi.ac.uk
Fri Oct 29 08:12:09 UTC 2010

Hi Vishal,

As far as I am aware there is nothing which will generate them in BioJava at the moment. However it is possible to do it with BioJava3:

public static void main(String[] args) {
    DNASequence d = new DNASequence("ATGATC");

public static final int KMER = 3;

//Generate triplets overlapping
public static void overlap(Sequence<NucleotideCompound> d) {
    List<WindowedSequence<NucleotideCompound>> l =
            new ArrayList<WindowedSequence<NucleotideCompound>>();
    for(int i=1; i<=KMER; i++) {
        SequenceView<NucleotideCompound> sub = d.getSubSequence(
                i, d.getLength());
        WindowedSequence<NucleotideCompound> w =
            new WindowedSequence<NucleotideCompound>(sub, KMER);

    //Will return ATG, ATC, TGA & GAT
    for(WindowedSequence<NucleotideCompound> w: l) {
        for(List<NucleotideCompound> subList: w) {

//Generate triplet Compound lists non-overlapping
public static void nonOverlap(Sequence<NucleotideCompound> d) {
    WindowedSequence<NucleotideCompound> w = 
            new WindowedSequence<NucleotideCompound>(d, KMER);
    //Will return ATG & ATC
    for(List<NucleotideCompound> subList: w) {

The disadvantage of all of these solutions is that they generate lists of Compounds so kmer generation can/will be a memory intensive operation. This does mean it has to be since sub sequences are thin wrappers around an underlying sequence. Also the overlap solution is non-optimal since it iterates through each window rather than stepping through delegating onto each base in turn (hence why we get ATG & ATC before TGA)

As for unique k-mers that's something which would require a bit more engineering & would be better suited to a solution built around a Trie (prefix tree).

Hope this helps,


On 28 Oct 2010, at 18:40, Vishal Thapar wrote:

> Hi All,
> I had a quick question: Does Biojava have a method to generate k-mers or
> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
> counts for every sequence in a fasta file. If something like this exists it
> would save me some time to write the code.
> Thanks,
> Vishal
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/

More information about the Biojava-l mailing list