[Biojava-l] K-mers

Fri Oct 29 19:34:11 UTC 2010

So we've got some basic kmer work now in SVN. If you look in the class SequenceMixin there are two static methods there for generating the two types of k-mers. It's not developed with Map storage in mind & I'll leave the door open there for anyone else to come in & develop it. The k-mers are also not unique across the sequence but it's a start :)

Share & enjoy!

Andy

On 29 Oct 2010, at 19:50, jitesh dundas wrote:

> I agree Andy. These have become standard functionalities that
> scientists do these days. I am all for implementing that in BioJava3.
> Java isn't that efficient for such functionalities so we will surely
> need more effort compared to the same in Python/Perl.
> 
> Regards,
> Jitesh Dundas
> 
> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> So if it's a suffix tree that's quite a fixed data structure so the chances
>> of developing a pluggable mechanism there would be hard. I think there also
>> has to be a limit as to what we can sensibly do. If people want to
>> contribute this kind of work though then it's all be very well received
>> (with the corresponding test environment/cases of course).
>> 
>> Cheers,
>> 
>> Andy
>> 
>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>> 
>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>> would allow a developer to use anything from a simple MultiMap, to a NoSQL
>>> key-value database to store K-mers.  You could plugin custom map
>>> implementations to allow you to keep a count of the number of instances of
>>> particular K-mers that were found.  It might also be useful to be able to
>>> do
>>> set operations on those K-mer collections.  You could use it to determine
>>> which K-mers were present in a pathogen and not in a host.
>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>> 
>>> Cheers,
>>> 
>>> Mark
>>> 
>>> card.ly: <http://card.ly/phidias51>
>>> 
>>> 
>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>> <vishalthapar at gmail.com>wrote:
>>> 
>>>> Hi Andy,
>>>> 
>>>> This is good to have. I feel that including it as a part of core may not
>>>> be
>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>> nice.
>>>> There is a project Bioinformatica
>>>> 
>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>> does something similar although not exactly. It counts the k-mers in a
>>>> given fasta file but it does not count k-mers for each sequence within
>>>> the
>>>> file, just all within a file. This is a good feature to have specially if
>>>> one is trying to find patterns within sequences which is what I am trying
>>>> to
>>>> do. It would most certainly be helpful to have a k-mer counting algorithm
>>>> that counts k-mer frequency for each sequence. The way to go would be to
>>>> use
>>>> suffix trees. Again I don't know if biojava has a suffix tree api or not
>>>> since I haven't used java in a while and am just switching back to it. A
>>>> paper on using suffix trees to generate genome wide k-mer frequencies is:
>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>> software
>>>> is tallymer). It would be some work to implement this in java as a module
>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>> fasta
>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>> files, I think that might be the way to go.
>>>> 
>>>> Thats just my two cents.What do you think?
>>>> 
>>>> -vishal
>>>> 
>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> 
>>>>> Hi Vishal,
>>>>> 
>>>>> As far as I am aware there is nothing which will generate them in
>>>>> BioJava
>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>> 
>>>>> public static void main(String[] args) {
>>>>>  DNASequence d = new DNASequence("ATGATC");
>>>>>  System.out.println("Non-Overlap");
>>>>>  nonOverlap(d);
>>>>>  System.out.println("Overlap");
>>>>>  overlap(d);
>>>>> }
>>>>> 
>>>>> public static final int KMER = 3;
>>>>> 
>>>>> //Generate triplets overlapping
>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>  List<WindowedSequence<NucleotideCompound>> l =
>>>>>          new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>  for(int i=1; i<=KMER; i++) {
>>>>>      SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>              i, d.getLength());
>>>>>      WindowedSequence<NucleotideCompound> w =
>>>>>          new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>      l.add(w);
>>>>>  }
>>>>> 
>>>>>  //Will return ATG, ATC, TGA & GAT
>>>>>  for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>      for(List<NucleotideCompound> subList: w) {
>>>>>          System.out.println(subList);
>>>>>      }
>>>>>  }
>>>>> }
>>>>> 
>>>>> //Generate triplet Compound lists non-overlapping
>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>  WindowedSequence<NucleotideCompound> w =
>>>>>          new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>  //Will return ATG & ATC
>>>>>  for(List<NucleotideCompound> subList: w) {
>>>>>      System.out.println(subList);
>>>>>  }
>>>>> }
>>>>> 
>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>> of
>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>> This
>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>> iterates through each window rather than stepping through delegating
>>>>> onto
>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>> 
>>>>> As for unique k-mers that's something which would require a bit more
>>>>> engineering & would be better suited to a solution built around a Trie
>>>>> (prefix tree).
>>>>> 
>>>>> Hope this helps,
>>>>> 
>>>>> Andy
>>>>> 
>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>> or
>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>> k-mer
>>>>>> counts for every sequence in a fasta file. If something like this
>>>> exists
>>>>> it
>>>>>> would save me some time to write the code.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Vishal
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>> 
>>>>> --
>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Vishal Thapar, Ph.D.*
>>>> *Scientific informatics Analyst
>>>> Cold Spring Harbor Lab
>>>> Quick Bldg, Lowe Lab
>>>> 1 Bungtown Road
>>>> Cold Spring Harbor, NY - 11724*
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/