[Biojava-l] K-mers

Mark Fortner phidias51 at gmail.com
Fri Oct 29 16:56:45 UTC 2010


It might be useful to make the K-mer storage mechanism pluggable.  This
would allow a developer to use anything from a simple MultiMap, to a NoSQL
key-value database to store K-mers.  You could plugin custom map
implementations to allow you to keep a count of the number of instances of
particular K-mers that were found.  It might also be useful to be able to do
set operations on those K-mer collections.  You could use it to determine
which K-mers were present in a pathogen and not in a host.
http://www.ncbi.nlm.nih.gov/pubmed/20428334
http://www.ncbi.nlm.nih.gov/pubmed/16403026

Cheers,

Mark

card.ly: <http://card.ly/phidias51>


On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <vishalthapar at gmail.com>wrote:

> Hi Andy,
>
> This is good to have. I feel that including it as a part of core may not be
> necessary but having it as part of Genomic module in biojava3 will be nice.
> There is a project Bioinformatica
>
> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
> does something similar although not exactly. It counts the k-mers in a
> given fasta file but it does not count k-mers for each sequence within the
> file, just all within a file. This is a good feature to have specially if
> one is trying to find patterns within sequences which is what I am trying
> to
> do. It would most certainly be helpful to have a k-mer counting algorithm
> that counts k-mer frequency for each sequence. The way to go would be to
> use
> suffix trees. Again I don't know if biojava has a suffix tree api or not
> since I haven't used java in a while and am just switching back to it. A
> paper on using suffix trees to generate genome wide k-mer frequencies is:
> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
> software
> is tallymer). It would be some work to implement this in java as a module
> for biojava3 but I can see that this will be helpful. Again, for small
> fasta
> files, it might not be efficient to create a suffix tree but for bigger
> files, I think that might be the way to go.
>
> Thats just my two cents.What do you think?
>
> -vishal
>
> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>
> > Hi Vishal,
> >
> > As far as I am aware there is nothing which will generate them in BioJava
> > at the moment. However it is possible to do it with BioJava3:
> >
> > public static void main(String[] args) {
> >    DNASequence d = new DNASequence("ATGATC");
> >    System.out.println("Non-Overlap");
> >    nonOverlap(d);
> >    System.out.println("Overlap");
> >    overlap(d);
> > }
> >
> > public static final int KMER = 3;
> >
> > //Generate triplets overlapping
> > public static void overlap(Sequence<NucleotideCompound> d) {
> >    List<WindowedSequence<NucleotideCompound>> l =
> >            new ArrayList<WindowedSequence<NucleotideCompound>>();
> >    for(int i=1; i<=KMER; i++) {
> >        SequenceView<NucleotideCompound> sub = d.getSubSequence(
> >                i, d.getLength());
> >        WindowedSequence<NucleotideCompound> w =
> >            new WindowedSequence<NucleotideCompound>(sub, KMER);
> >        l.add(w);
> >    }
> >
> >    //Will return ATG, ATC, TGA & GAT
> >    for(WindowedSequence<NucleotideCompound> w: l) {
> >        for(List<NucleotideCompound> subList: w) {
> >            System.out.println(subList);
> >        }
> >    }
> > }
> >
> > //Generate triplet Compound lists non-overlapping
> > public static void nonOverlap(Sequence<NucleotideCompound> d) {
> >    WindowedSequence<NucleotideCompound> w =
> >            new WindowedSequence<NucleotideCompound>(d, KMER);
> >    //Will return ATG & ATC
> >    for(List<NucleotideCompound> subList: w) {
> >        System.out.println(subList);
> >    }
> > }
> >
> > The disadvantage of all of these solutions is that they generate lists of
> > Compounds so kmer generation can/will be a memory intensive operation.
> This
> > does mean it has to be since sub sequences are thin wrappers around an
> > underlying sequence. Also the overlap solution is non-optimal since it
> > iterates through each window rather than stepping through delegating onto
> > each base in turn (hence why we get ATG & ATC before TGA)
> >
> > As for unique k-mers that's something which would require a bit more
> > engineering & would be better suited to a solution built around a Trie
> > (prefix tree).
> >
> > Hope this helps,
> >
> > Andy
> >
> > On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> >
> > > Hi All,
> > >
> > > I had a quick question: Does Biojava have a method to generate k-mers
> or
> > > K-mer counting in a given Fasta Sequence / File? Basically, I want
> k-mer
> > > counts for every sequence in a fasta file. If something like this
> exists
> > it
> > > would save me some time to write the code.
> > >
> > > Thanks,
> > >
> > > Vishal
> > > _______________________________________________
> > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> > --
> > Andrew Yates                   Ensembl Genomes Engineer
> > EMBL-EBI                       Tel: +44-(0)1223-492538
> > Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> > Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >
> >
> >
> >
> >
>
>
> --
> *Vishal Thapar, Ph.D.*
> *Scientific informatics Analyst
> Cold Spring Harbor Lab
> Quick Bldg, Lowe Lab
> 1 Bungtown Road
> Cold Spring Harbor, NY - 11724*
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



More information about the Biojava-l mailing list