[Biojava-l] K-mers

jitesh dundas jbdundas at gmail.com
Fri Oct 29 18:50:11 UTC 2010


I agree Andy. These have become standard functionalities that
scientists do these days. I am all for implementing that in BioJava3.
Java isn't that efficient for such functionalities so we will surely
need more effort compared to the same in Python/Perl.

Regards,
Jitesh Dundas

On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> So if it's a suffix tree that's quite a fixed data structure so the chances
> of developing a pluggable mechanism there would be hard. I think there also
> has to be a limit as to what we can sensibly do. If people want to
> contribute this kind of work though then it's all be very well received
> (with the corresponding test environment/cases of course).
>
> Cheers,
>
> Andy
>
> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>
>> It might be useful to make the K-mer storage mechanism pluggable.  This
>> would allow a developer to use anything from a simple MultiMap, to a NoSQL
>> key-value database to store K-mers.  You could plugin custom map
>> implementations to allow you to keep a count of the number of instances of
>> particular K-mers that were found.  It might also be useful to be able to
>> do
>> set operations on those K-mer collections.  You could use it to determine
>> which K-mers were present in a pathogen and not in a host.
>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>
>> Cheers,
>>
>> Mark
>>
>> card.ly: <http://card.ly/phidias51>
>>
>>
>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>> <vishalthapar at gmail.com>wrote:
>>
>>> Hi Andy,
>>>
>>> This is good to have. I feel that including it as a part of core may not
>>> be
>>> necessary but having it as part of Genomic module in biojava3 will be
>>> nice.
>>> There is a project Bioinformatica
>>>
>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>> does something similar although not exactly. It counts the k-mers in a
>>> given fasta file but it does not count k-mers for each sequence within
>>> the
>>> file, just all within a file. This is a good feature to have specially if
>>> one is trying to find patterns within sequences which is what I am trying
>>> to
>>> do. It would most certainly be helpful to have a k-mer counting algorithm
>>> that counts k-mer frequency for each sequence. The way to go would be to
>>> use
>>> suffix trees. Again I don't know if biojava has a suffix tree api or not
>>> since I haven't used java in a while and am just switching back to it. A
>>> paper on using suffix trees to generate genome wide k-mer frequencies is:
>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>> software
>>> is tallymer). It would be some work to implement this in java as a module
>>> for biojava3 but I can see that this will be helpful. Again, for small
>>> fasta
>>> files, it might not be efficient to create a suffix tree but for bigger
>>> files, I think that might be the way to go.
>>>
>>> Thats just my two cents.What do you think?
>>>
>>> -vishal
>>>
>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> As far as I am aware there is nothing which will generate them in
>>>> BioJava
>>>> at the moment. However it is possible to do it with BioJava3:
>>>>
>>>> public static void main(String[] args) {
>>>>   DNASequence d = new DNASequence("ATGATC");
>>>>   System.out.println("Non-Overlap");
>>>>   nonOverlap(d);
>>>>   System.out.println("Overlap");
>>>>   overlap(d);
>>>> }
>>>>
>>>> public static final int KMER = 3;
>>>>
>>>> //Generate triplets overlapping
>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>   List<WindowedSequence<NucleotideCompound>> l =
>>>>           new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>   for(int i=1; i<=KMER; i++) {
>>>>       SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>               i, d.getLength());
>>>>       WindowedSequence<NucleotideCompound> w =
>>>>           new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>       l.add(w);
>>>>   }
>>>>
>>>>   //Will return ATG, ATC, TGA & GAT
>>>>   for(WindowedSequence<NucleotideCompound> w: l) {
>>>>       for(List<NucleotideCompound> subList: w) {
>>>>           System.out.println(subList);
>>>>       }
>>>>   }
>>>> }
>>>>
>>>> //Generate triplet Compound lists non-overlapping
>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>   WindowedSequence<NucleotideCompound> w =
>>>>           new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>   //Will return ATG & ATC
>>>>   for(List<NucleotideCompound> subList: w) {
>>>>       System.out.println(subList);
>>>>   }
>>>> }
>>>>
>>>> The disadvantage of all of these solutions is that they generate lists
>>>> of
>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>> iterates through each window rather than stepping through delegating
>>>> onto
>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>
>>>> As for unique k-mers that's something which would require a bit more
>>>> engineering & would be better suited to a solution built around a Trie
>>>> (prefix tree).
>>>>
>>>> Hope this helps,
>>>>
>>>> Andy
>>>>
>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>> or
>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>> k-mer
>>>>> counts for every sequence in a fasta file. If something like this
>>> exists
>>>> it
>>>>> would save me some time to write the code.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Vishal
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>> --
>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Vishal Thapar, Ph.D.*
>>> *Scientific informatics Analyst
>>> Cold Spring Harbor Lab
>>> Quick Bldg, Lowe Lab
>>> 1 Bungtown Road
>>> Cold Spring Harbor, NY - 11724*
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



More information about the Biojava-l mailing list