[Biojava-l] K-mers
Andy Yates
ayates at ebi.ac.uk
Fri Oct 29 08:12:09 UTC 2010
Hi Vishal,
As far as I am aware there is nothing which will generate them in BioJava at the moment. However it is possible to do it with BioJava3:
public static void main(String[] args) {
DNASequence d = new DNASequence("ATGATC");
System.out.println("Non-Overlap");
nonOverlap(d);
System.out.println("Overlap");
overlap(d);
}
public static final int KMER = 3;
//Generate triplets overlapping
public static void overlap(Sequence<NucleotideCompound> d) {
List<WindowedSequence<NucleotideCompound>> l =
new ArrayList<WindowedSequence<NucleotideCompound>>();
for(int i=1; i<=KMER; i++) {
SequenceView<NucleotideCompound> sub = d.getSubSequence(
i, d.getLength());
WindowedSequence<NucleotideCompound> w =
new WindowedSequence<NucleotideCompound>(sub, KMER);
l.add(w);
}
//Will return ATG, ATC, TGA & GAT
for(WindowedSequence<NucleotideCompound> w: l) {
for(List<NucleotideCompound> subList: w) {
System.out.println(subList);
}
}
}
//Generate triplet Compound lists non-overlapping
public static void nonOverlap(Sequence<NucleotideCompound> d) {
WindowedSequence<NucleotideCompound> w =
new WindowedSequence<NucleotideCompound>(d, KMER);
//Will return ATG & ATC
for(List<NucleotideCompound> subList: w) {
System.out.println(subList);
}
}
The disadvantage of all of these solutions is that they generate lists of Compounds so kmer generation can/will be a memory intensive operation. This does mean it has to be since sub sequences are thin wrappers around an underlying sequence. Also the overlap solution is non-optimal since it iterates through each window rather than stepping through delegating onto each base in turn (hence why we get ATG & ATC before TGA)
As for unique k-mers that's something which would require a bit more engineering & would be better suited to a solution built around a Trie (prefix tree).
Hope this helps,
Andy
On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> Hi All,
>
> I had a quick question: Does Biojava have a method to generate k-mers or
> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
> counts for every sequence in a fasta file. If something like this exists it
> would save me some time to write the code.
>
> Thanks,
>
> Vishal
> _______________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
--
Andrew Yates Ensembl Genomes Engineer
EMBL-EBI Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
More information about the Biojava-l
mailing list