[Biojava-l] K-mers

Andy Yates ayates at ebi.ac.uk
Fri Oct 29 18:35:43 UTC 2010


So if it's a suffix tree that's quite a fixed data structure so the chances of developing a pluggable mechanism there would be hard. I think there also has to be a limit as to what we can sensibly do. If people want to contribute this kind of work though then it's all be very well received (with the corresponding test environment/cases of course).

Cheers,

Andy

On 29 Oct 2010, at 17:56, Mark Fortner wrote:

> It might be useful to make the K-mer storage mechanism pluggable.  This
> would allow a developer to use anything from a simple MultiMap, to a NoSQL
> key-value database to store K-mers.  You could plugin custom map
> implementations to allow you to keep a count of the number of instances of
> particular K-mers that were found.  It might also be useful to be able to do
> set operations on those K-mer collections.  You could use it to determine
> which K-mers were present in a pathogen and not in a host.
> http://www.ncbi.nlm.nih.gov/pubmed/20428334
> http://www.ncbi.nlm.nih.gov/pubmed/16403026
> 
> Cheers,
> 
> Mark
> 
> card.ly: <http://card.ly/phidias51>
> 
> 
> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <vishalthapar at gmail.com>wrote:
> 
>> Hi Andy,
>> 
>> This is good to have. I feel that including it as a part of core may not be
>> necessary but having it as part of Genomic module in biojava3 will be nice.
>> There is a project Bioinformatica
>> 
>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>> does something similar although not exactly. It counts the k-mers in a
>> given fasta file but it does not count k-mers for each sequence within the
>> file, just all within a file. This is a good feature to have specially if
>> one is trying to find patterns within sequences which is what I am trying
>> to
>> do. It would most certainly be helpful to have a k-mer counting algorithm
>> that counts k-mer frequency for each sequence. The way to go would be to
>> use
>> suffix trees. Again I don't know if biojava has a suffix tree api or not
>> since I haven't used java in a while and am just switching back to it. A
>> paper on using suffix trees to generate genome wide k-mer frequencies is:
>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>> software
>> is tallymer). It would be some work to implement this in java as a module
>> for biojava3 but I can see that this will be helpful. Again, for small
>> fasta
>> files, it might not be efficient to create a suffix tree but for bigger
>> files, I think that might be the way to go.
>> 
>> Thats just my two cents.What do you think?
>> 
>> -vishal
>> 
>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> 
>>> Hi Vishal,
>>> 
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at the moment. However it is possible to do it with BioJava3:
>>> 
>>> public static void main(String[] args) {
>>>   DNASequence d = new DNASequence("ATGATC");
>>>   System.out.println("Non-Overlap");
>>>   nonOverlap(d);
>>>   System.out.println("Overlap");
>>>   overlap(d);
>>> }
>>> 
>>> public static final int KMER = 3;
>>> 
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>   List<WindowedSequence<NucleotideCompound>> l =
>>>           new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>   for(int i=1; i<=KMER; i++) {
>>>       SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>               i, d.getLength());
>>>       WindowedSequence<NucleotideCompound> w =
>>>           new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>       l.add(w);
>>>   }
>>> 
>>>   //Will return ATG, ATC, TGA & GAT
>>>   for(WindowedSequence<NucleotideCompound> w: l) {
>>>       for(List<NucleotideCompound> subList: w) {
>>>           System.out.println(subList);
>>>       }
>>>   }
>>> }
>>> 
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>   WindowedSequence<NucleotideCompound> w =
>>>           new WindowedSequence<NucleotideCompound>(d, KMER);
>>>   //Will return ATG & ATC
>>>   for(List<NucleotideCompound> subList: w) {
>>>       System.out.println(subList);
>>>   }
>>> }
>>> 
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>> 
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>> 
>>> Hope this helps,
>>> 
>>> Andy
>>> 
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I had a quick question: Does Biojava have a method to generate k-mers
>> or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>> k-mer
>>>> counts for every sequence in a fasta file. If something like this
>> exists
>>> it
>>>> would save me some time to write the code.
>>>> 
>>>> Thanks,
>>>> 
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> --
>> *Vishal Thapar, Ph.D.*
>> *Scientific informatics Analyst
>> Cold Spring Harbor Lab
>> Quick Bldg, Lowe Lab
>> 1 Bungtown Road
>> Cold Spring Harbor, NY - 11724*
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the Biojava-l mailing list