[Biojava-l] K-mers

Fri Oct 29 19:43:38 UTC 2010

That is good news.Thanks for the directions Andy.

I have already started on this.Let me analyze and write the code now.

Maybe a next month deadline is not unreachable in this case.

Here we go!
JD

On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> So we've got some basic kmer work now in SVN. If you look in the class
> SequenceMixin there are two static methods there for generating the two
> types of k-mers. It's not developed with Map storage in mind & I'll leave
> the door open there for anyone else to come in & develop it. The k-mers are
> also not unique across the sequence but it's a start :)
>
> Share & enjoy!
>
> Andy
>
> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
>
>> I agree Andy. These have become standard functionalities that
>> scientists do these days. I am all for implementing that in BioJava3.
>> Java isn't that efficient for such functionalities so we will surely
>> need more effort compared to the same in Python/Perl.
>>
>> Regards,
>> Jitesh Dundas
>>
>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> So if it's a suffix tree that's quite a fixed data structure so the
>>> chances
>>> of developing a pluggable mechanism there would be hard. I think there
>>> also
>>> has to be a limit as to what we can sensibly do. If people want to
>>> contribute this kind of work though then it's all be very well received
>>> (with the corresponding test environment/cases of course).
>>>
>>> Cheers,
>>>
>>> Andy
>>>
>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>>>
>>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>>> would allow a developer to use anything from a simple MultiMap, to a
>>>> NoSQL
>>>> key-value database to store K-mers.  You could plugin custom map
>>>> implementations to allow you to keep a count of the number of instances
>>>> of
>>>> particular K-mers that were found.  It might also be useful to be able
>>>> to
>>>> do
>>>> set operations on those K-mer collections.  You could use it to
>>>> determine
>>>> which K-mers were present in a pathogen and not in a host.
>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>>>
>>>> Cheers,
>>>>
>>>> Mark
>>>>
>>>> card.ly: <http://card.ly/phidias51>
>>>>
>>>>
>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>>> <vishalthapar at gmail.com>wrote:
>>>>
>>>>> Hi Andy,
>>>>>
>>>>> This is good to have. I feel that including it as a part of core may
>>>>> not
>>>>> be
>>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>>> nice.
>>>>> There is a project Bioinformatica
>>>>>
>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>>> does something similar although not exactly. It counts the k-mers in a
>>>>> given fasta file but it does not count k-mers for each sequence within
>>>>> the
>>>>> file, just all within a file. This is a good feature to have specially
>>>>> if
>>>>> one is trying to find patterns within sequences which is what I am
>>>>> trying
>>>>> to
>>>>> do. It would most certainly be helpful to have a k-mer counting
>>>>> algorithm
>>>>> that counts k-mer frequency for each sequence. The way to go would be
>>>>> to
>>>>> use
>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
>>>>> not
>>>>> since I haven't used java in a while and am just switching back to it.
>>>>> A
>>>>> paper on using suffix trees to generate genome wide k-mer frequencies
>>>>> is:
>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>>> software
>>>>> is tallymer). It would be some work to implement this in java as a
>>>>> module
>>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>>> fasta
>>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>>> files, I think that might be the way to go.
>>>>>
>>>>> Thats just my two cents.What do you think?
>>>>>
>>>>> -vishal
>>>>>
>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>
>>>>>> Hi Vishal,
>>>>>>
>>>>>> As far as I am aware there is nothing which will generate them in
>>>>>> BioJava
>>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>>>
>>>>>> public static void main(String[] args) {
>>>>>>  DNASequence d = new DNASequence("ATGATC");
>>>>>>  System.out.println("Non-Overlap");
>>>>>>  nonOverlap(d);
>>>>>>  System.out.println("Overlap");
>>>>>>  overlap(d);
>>>>>> }
>>>>>>
>>>>>> public static final int KMER = 3;
>>>>>>
>>>>>> //Generate triplets overlapping
>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>>  List<WindowedSequence<NucleotideCompound>> l =
>>>>>>          new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>>  for(int i=1; i<=KMER; i++) {
>>>>>>      SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>>              i, d.getLength());
>>>>>>      WindowedSequence<NucleotideCompound> w =
>>>>>>          new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>>      l.add(w);
>>>>>>  }
>>>>>>
>>>>>>  //Will return ATG, ATC, TGA & GAT
>>>>>>  for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>>      for(List<NucleotideCompound> subList: w) {
>>>>>>          System.out.println(subList);
>>>>>>      }
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>> //Generate triplet Compound lists non-overlapping
>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>>  WindowedSequence<NucleotideCompound> w =
>>>>>>          new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>>  //Will return ATG & ATC
>>>>>>  for(List<NucleotideCompound> subList: w) {
>>>>>>      System.out.println(subList);
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>>> of
>>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>>> This
>>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>>> iterates through each window rather than stepping through delegating
>>>>>> onto
>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>>>
>>>>>> As for unique k-mers that's something which would require a bit more
>>>>>> engineering & would be better suited to a solution built around a Trie
>>>>>> (prefix tree).
>>>>>>
>>>>>> Hope this helps,
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>>> or
>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>>> k-mer
>>>>>>> counts for every sequence in a fasta file. If something like this
>>>>> exists
>>>>>> it
>>>>>>> would save me some time to write the code.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Vishal
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>
>>>>>> --
>>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Vishal Thapar, Ph.D.*
>>>>> *Scientific informatics Analyst
>>>>> Cold Spring Harbor Lab
>>>>> Quick Bldg, Lowe Lab
>>>>> 1 Bungtown Road
>>>>> Cold Spring Harbor, NY - 11724*
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>