[Biojava-l] K-mers

Sat Oct 30 09:20:30 UTC 2010

You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight. 

Just goes to show you should always do more testing than you think :).

Andy

On 29 Oct 2010, at 20:43, jitesh dundas wrote:

> That is good news.Thanks for the directions Andy.
> 
> I have already started on this.Let me analyze and write the code now.
> 
> Maybe a next month deadline is not unreachable in this case.
> 
> Here we go!
> JD
> 
> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> So we've got some basic kmer work now in SVN. If you look in the class
>> SequenceMixin there are two static methods there for generating the two
>> types of k-mers. It's not developed with Map storage in mind & I'll leave
>> the door open there for anyone else to come in & develop it. The k-mers are
>> also not unique across the sequence but it's a start :)
>> 
>> Share & enjoy!
>> 
>> Andy
>> 
>> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
>> 
>>> I agree Andy. These have become standard functionalities that
>>> scientists do these days. I am all for implementing that in BioJava3.
>>> Java isn't that efficient for such functionalities so we will surely
>>> need more effort compared to the same in Python/Perl.
>>> 
>>> Regards,
>>> Jitesh Dundas
>>> 
>>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> So if it's a suffix tree that's quite a fixed data structure so the
>>>> chances
>>>> of developing a pluggable mechanism there would be hard. I think there
>>>> also
>>>> has to be a limit as to what we can sensibly do. If people want to
>>>> contribute this kind of work though then it's all be very well received
>>>> (with the corresponding test environment/cases of course).
>>>> 
>>>> Cheers,
>>>> 
>>>> Andy
>>>> 
>>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>>>> 
>>>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>>>> would allow a developer to use anything from a simple MultiMap, to a
>>>>> NoSQL
>>>>> key-value database to store K-mers.  You could plugin custom map
>>>>> implementations to allow you to keep a count of the number of instances
>>>>> of
>>>>> particular K-mers that were found.  It might also be useful to be able
>>>>> to
>>>>> do
>>>>> set operations on those K-mer collections.  You could use it to
>>>>> determine
>>>>> which K-mers were present in a pathogen and not in a host.
>>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Mark
>>>>> 
>>>>> card.ly: <http://card.ly/phidias51>
>>>>> 
>>>>> 
>>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>>>> <vishalthapar at gmail.com>wrote:
>>>>> 
>>>>>> Hi Andy,
>>>>>> 
>>>>>> This is good to have. I feel that including it as a part of core may
>>>>>> not
>>>>>> be
>>>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>>>> nice.
>>>>>> There is a project Bioinformatica
>>>>>> 
>>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>>>> does something similar although not exactly. It counts the k-mers in a
>>>>>> given fasta file but it does not count k-mers for each sequence within
>>>>>> the
>>>>>> file, just all within a file. This is a good feature to have specially
>>>>>> if
>>>>>> one is trying to find patterns within sequences which is what I am
>>>>>> trying
>>>>>> to
>>>>>> do. It would most certainly be helpful to have a k-mer counting
>>>>>> algorithm
>>>>>> that counts k-mer frequency for each sequence. The way to go would be
>>>>>> to
>>>>>> use
>>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
>>>>>> not
>>>>>> since I haven't used java in a while and am just switching back to it.
>>>>>> A
>>>>>> paper on using suffix trees to generate genome wide k-mer frequencies
>>>>>> is:
>>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>>>> software
>>>>>> is tallymer). It would be some work to implement this in java as a
>>>>>> module
>>>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>>>> fasta
>>>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>>>> files, I think that might be the way to go.
>>>>>> 
>>>>>> Thats just my two cents.What do you think?
>>>>>> 
>>>>>> -vishal
>>>>>> 
>>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>> 
>>>>>>> Hi Vishal,
>>>>>>> 
>>>>>>> As far as I am aware there is nothing which will generate them in
>>>>>>> BioJava
>>>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>>>> 
>>>>>>> public static void main(String[] args) {
>>>>>>> DNASequence d = new DNASequence("ATGATC");
>>>>>>> System.out.println("Non-Overlap");
>>>>>>> nonOverlap(d);
>>>>>>> System.out.println("Overlap");
>>>>>>> overlap(d);
>>>>>>> }
>>>>>>> 
>>>>>>> public static final int KMER = 3;
>>>>>>> 
>>>>>>> //Generate triplets overlapping
>>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>>> List<WindowedSequence<NucleotideCompound>> l =
>>>>>>>         new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>>> for(int i=1; i<=KMER; i++) {
>>>>>>>     SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>>>             i, d.getLength());
>>>>>>>     WindowedSequence<NucleotideCompound> w =
>>>>>>>         new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>>>     l.add(w);
>>>>>>> }
>>>>>>> 
>>>>>>> //Will return ATG, ATC, TGA & GAT
>>>>>>> for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>>>     for(List<NucleotideCompound> subList: w) {
>>>>>>>         System.out.println(subList);
>>>>>>>     }
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> //Generate triplet Compound lists non-overlapping
>>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>>> WindowedSequence<NucleotideCompound> w =
>>>>>>>         new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>>> //Will return ATG & ATC
>>>>>>> for(List<NucleotideCompound> subList: w) {
>>>>>>>     System.out.println(subList);
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>>>> of
>>>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>>>> This
>>>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>>>> iterates through each window rather than stepping through delegating
>>>>>>> onto
>>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>>>> 
>>>>>>> As for unique k-mers that's something which would require a bit more
>>>>>>> engineering & would be better suited to a solution built around a Trie
>>>>>>> (prefix tree).
>>>>>>> 
>>>>>>> Hope this helps,
>>>>>>> 
>>>>>>> Andy
>>>>>>> 
>>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>>>> 
>>>>>>>> Hi All,
>>>>>>>> 
>>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>>>> or
>>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>>>> k-mer
>>>>>>>> counts for every sequence in a fasta file. If something like this
>>>>>> exists
>>>>>>> it
>>>>>>>> would save me some time to write the code.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Vishal
>>>>>>>> _______________________________________________
>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>> 
>>>>>>> --
>>>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> *Vishal Thapar, Ph.D.*
>>>>>> *Scientific informatics Analyst
>>>>>> Cold Spring Harbor Lab
>>>>>> Quick Bldg, Lowe Lab
>>>>>> 1 Bungtown Road
>>>>>> Cold Spring Harbor, NY - 11724*
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>> 
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>>> --
>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/