[Biojava-l] K-mers

Fri Oct 29 18:48:53 UTC 2010

I was thinking more along the lines of using something that implements the
Map interface.  This would allow a developer to easily unit test the code
without having to load the data for a genome.  You would also be able to
provide different implementations to suit your needs.  If you wanted to use
a suffix tree as the underlying implementation, that would be OK, but you
would have other options as well.

Cheers,

Mark

card.ly: <http://card.ly/phidias51>

On Fri, Oct 29, 2010 at 11:35 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> So if it's a suffix tree that's quite a fixed data structure so the chances
> of developing a pluggable mechanism there would be hard. I think there also
> has to be a limit as to what we can sensibly do. If people want to
> contribute this kind of work though then it's all be very well received
> (with the corresponding test environment/cases of course).
>
> Cheers,
>
> Andy
>
> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>
> > It might be useful to make the K-mer storage mechanism pluggable.  This
> > would allow a developer to use anything from a simple MultiMap, to a
> NoSQL
> > key-value database to store K-mers.  You could plugin custom map
> > implementations to allow you to keep a count of the number of instances
> of
> > particular K-mers that were found.  It might also be useful to be able to
> do
> > set operations on those K-mer collections.  You could use it to determine
> > which K-mers were present in a pathogen and not in a host.
> > http://www.ncbi.nlm.nih.gov/pubmed/20428334
> > http://www.ncbi.nlm.nih.gov/pubmed/16403026
> >
> > Cheers,
> >
> > Mark
> >
> > card.ly: <http://card.ly/phidias51>
> >
> >
> > On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <vishalthapar at gmail.com
> >wrote:
> >
> >> Hi Andy,
> >>
> >> This is good to have. I feel that including it as a part of core may not
> be
> >> necessary but having it as part of Genomic module in biojava3 will be
> nice.
> >> There is a project Bioinformatica
> >>
> >>
> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
> >> does something similar although not exactly. It counts the k-mers in a
> >> given fasta file but it does not count k-mers for each sequence within
> the
> >> file, just all within a file. This is a good feature to have specially
> if
> >> one is trying to find patterns within sequences which is what I am
> trying
> >> to
> >> do. It would most certainly be helpful to have a k-mer counting
> algorithm
> >> that counts k-mer frequency for each sequence. The way to go would be to
> >> use
> >> suffix trees. Again I don't know if biojava has a suffix tree api or not
> >> since I haven't used java in a while and am just switching back to it. A
> >> paper on using suffix trees to generate genome wide k-mer frequencies
> is:
> >> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
> >> software
> >> is tallymer). It would be some work to implement this in java as a
> module
> >> for biojava3 but I can see that this will be helpful. Again, for small
> >> fasta
> >> files, it might not be efficient to create a suffix tree but for bigger
> >> files, I think that might be the way to go.
> >>
> >> Thats just my two cents.What do you think?
> >>
> >> -vishal
> >>
> >> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>
> >>> Hi Vishal,
> >>>
> >>> As far as I am aware there is nothing which will generate them in
> BioJava
> >>> at the moment. However it is possible to do it with BioJava3:
> >>>
> >>> public static void main(String[] args) {
> >>>   DNASequence d = new DNASequence("ATGATC");
> >>>   System.out.println("Non-Overlap");
> >>>   nonOverlap(d);
> >>>   System.out.println("Overlap");
> >>>   overlap(d);
> >>> }
> >>>
> >>> public static final int KMER = 3;
> >>>
> >>> //Generate triplets overlapping
> >>> public static void overlap(Sequence<NucleotideCompound> d) {
> >>>   List<WindowedSequence<NucleotideCompound>> l =
> >>>           new ArrayList<WindowedSequence<NucleotideCompound>>();
> >>>   for(int i=1; i<=KMER; i++) {
> >>>       SequenceView<NucleotideCompound> sub = d.getSubSequence(
> >>>               i, d.getLength());
> >>>       WindowedSequence<NucleotideCompound> w =
> >>>           new WindowedSequence<NucleotideCompound>(sub, KMER);
> >>>       l.add(w);
> >>>   }
> >>>
> >>>   //Will return ATG, ATC, TGA & GAT
> >>>   for(WindowedSequence<NucleotideCompound> w: l) {
> >>>       for(List<NucleotideCompound> subList: w) {
> >>>           System.out.println(subList);
> >>>       }
> >>>   }
> >>> }
> >>>
> >>> //Generate triplet Compound lists non-overlapping
> >>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
> >>>   WindowedSequence<NucleotideCompound> w =
> >>>           new WindowedSequence<NucleotideCompound>(d, KMER);
> >>>   //Will return ATG & ATC
> >>>   for(List<NucleotideCompound> subList: w) {
> >>>       System.out.println(subList);
> >>>   }
> >>> }
> >>>
> >>> The disadvantage of all of these solutions is that they generate lists
> of
> >>> Compounds so kmer generation can/will be a memory intensive operation.
> >> This
> >>> does mean it has to be since sub sequences are thin wrappers around an
> >>> underlying sequence. Also the overlap solution is non-optimal since it
> >>> iterates through each window rather than stepping through delegating
> onto
> >>> each base in turn (hence why we get ATG & ATC before TGA)
> >>>
> >>> As for unique k-mers that's something which would require a bit more
> >>> engineering & would be better suited to a solution built around a Trie
> >>> (prefix tree).
> >>>
> >>> Hope this helps,
> >>>
> >>> Andy
> >>>
> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> I had a quick question: Does Biojava have a method to generate k-mers
> >> or
> >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
> >> k-mer
> >>>> counts for every sequence in a fasta file. If something like this
> >> exists
> >>> it
> >>>> would save me some time to write the code.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Vishal
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>
> >>> --
> >>> Andrew Yates                   Ensembl Genomes Engineer
> >>> EMBL-EBI                       Tel: +44-(0)1223-492538
> >>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> >>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >> *Vishal Thapar, Ph.D.*
> >> *Scientific informatics Analyst
> >> Cold Spring Harbor Lab
> >> Quick Bldg, Lowe Lab
> >> 1 Bungtown Road
> >> Cold Spring Harbor, NY - 11724*
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>