[Biojava-l] K-mers
Andy Yates
ayates at ebi.ac.uk
Sat Oct 30 09:20:30 UTC 2010
You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight.
Just goes to show you should always do more testing than you think :).
Andy
On 29 Oct 2010, at 20:43, jitesh dundas wrote:
> That is good news.Thanks for the directions Andy.
>
> I have already started on this.Let me analyze and write the code now.
>
> Maybe a next month deadline is not unreachable in this case.
>
> Here we go!
> JD
>
> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> So we've got some basic kmer work now in SVN. If you look in the class
>> SequenceMixin there are two static methods there for generating the two
>> types of k-mers. It's not developed with Map storage in mind & I'll leave
>> the door open there for anyone else to come in & develop it. The k-mers are
>> also not unique across the sequence but it's a start :)
>>
>> Share & enjoy!
>>
>> Andy
>>
>> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
>>
>>> I agree Andy. These have become standard functionalities that
>>> scientists do these days. I am all for implementing that in BioJava3.
>>> Java isn't that efficient for such functionalities so we will surely
>>> need more effort compared to the same in Python/Perl.
>>>
>>> Regards,
>>> Jitesh Dundas
>>>
>>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> So if it's a suffix tree that's quite a fixed data structure so the
>>>> chances
>>>> of developing a pluggable mechanism there would be hard. I think there
>>>> also
>>>> has to be a limit as to what we can sensibly do. If people want to
>>>> contribute this kind of work though then it's all be very well received
>>>> (with the corresponding test environment/cases of course).
>>>>
>>>> Cheers,
>>>>
>>>> Andy
>>>>
>>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>>>>
>>>>> It might be useful to make the K-mer storage mechanism pluggable. This
>>>>> would allow a developer to use anything from a simple MultiMap, to a
>>>>> NoSQL
>>>>> key-value database to store K-mers. You could plugin custom map
>>>>> implementations to allow you to keep a count of the number of instances
>>>>> of
>>>>> particular K-mers that were found. It might also be useful to be able
>>>>> to
>>>>> do
>>>>> set operations on those K-mer collections. You could use it to
>>>>> determine
>>>>> which K-mers were present in a pathogen and not in a host.
>>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Mark
>>>>>
>>>>> card.ly: <http://card.ly/phidias51>
>>>>>
>>>>>
>>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>>>> <vishalthapar at gmail.com>wrote:
>>>>>
>>>>>> Hi Andy,
>>>>>>
>>>>>> This is good to have. I feel that including it as a part of core may
>>>>>> not
>>>>>> be
>>>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>>>> nice.
>>>>>> There is a project Bioinformatica
>>>>>>
>>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>>>> does something similar although not exactly. It counts the k-mers in a
>>>>>> given fasta file but it does not count k-mers for each sequence within
>>>>>> the
>>>>>> file, just all within a file. This is a good feature to have specially
>>>>>> if
>>>>>> one is trying to find patterns within sequences which is what I am
>>>>>> trying
>>>>>> to
>>>>>> do. It would most certainly be helpful to have a k-mer counting
>>>>>> algorithm
>>>>>> that counts k-mer frequency for each sequence. The way to go would be
>>>>>> to
>>>>>> use
>>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
>>>>>> not
>>>>>> since I haven't used java in a while and am just switching back to it.
>>>>>> A
>>>>>> paper on using suffix trees to generate genome wide k-mer frequencies
>>>>>> is:
>>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>>>> software
>>>>>> is tallymer). It would be some work to implement this in java as a
>>>>>> module
>>>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>>>> fasta
>>>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>>>> files, I think that might be the way to go.
>>>>>>
>>>>>> Thats just my two cents.What do you think?
>>>>>>
>>>>>> -vishal
>>>>>>
>>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>
>>>>>>> Hi Vishal,
>>>>>>>
>>>>>>> As far as I am aware there is nothing which will generate them in
>>>>>>> BioJava
>>>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>>>>
>>>>>>> public static void main(String[] args) {
>>>>>>> DNASequence d = new DNASequence("ATGATC");
>>>>>>> System.out.println("Non-Overlap");
>>>>>>> nonOverlap(d);
>>>>>>> System.out.println("Overlap");
>>>>>>> overlap(d);
>>>>>>> }
>>>>>>>
>>>>>>> public static final int KMER = 3;
>>>>>>>
>>>>>>> //Generate triplets overlapping
>>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>>> List<WindowedSequence<NucleotideCompound>> l =
>>>>>>> new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>>> for(int i=1; i<=KMER; i++) {
>>>>>>> SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>>> i, d.getLength());
>>>>>>> WindowedSequence<NucleotideCompound> w =
>>>>>>> new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>>> l.add(w);
>>>>>>> }
>>>>>>>
>>>>>>> //Will return ATG, ATC, TGA & GAT
>>>>>>> for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>>> for(List<NucleotideCompound> subList: w) {
>>>>>>> System.out.println(subList);
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> //Generate triplet Compound lists non-overlapping
>>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>>> WindowedSequence<NucleotideCompound> w =
>>>>>>> new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>>> //Will return ATG & ATC
>>>>>>> for(List<NucleotideCompound> subList: w) {
>>>>>>> System.out.println(subList);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>>>> of
>>>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>>>> This
>>>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>>>> iterates through each window rather than stepping through delegating
>>>>>>> onto
>>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>>>>
>>>>>>> As for unique k-mers that's something which would require a bit more
>>>>>>> engineering & would be better suited to a solution built around a Trie
>>>>>>> (prefix tree).
>>>>>>>
>>>>>>> Hope this helps,
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>>>> or
>>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>>>> k-mer
>>>>>>>> counts for every sequence in a fasta file. If something like this
>>>>>> exists
>>>>>>> it
>>>>>>>> would save me some time to write the code.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Vishal
>>>>>>>> _______________________________________________
>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>>>>>> --
>>>>>>> Andrew Yates Ensembl Genomes Engineer
>>>>>>> EMBL-EBI Tel: +44-(0)1223-492538
>>>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
>>>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Vishal Thapar, Ph.D.*
>>>>>> *Scientific informatics Analyst
>>>>>> Cold Spring Harbor Lab
>>>>>> Quick Bldg, Lowe Lab
>>>>>> 1 Bungtown Road
>>>>>> Cold Spring Harbor, NY - 11724*
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>
>>>>> _______________________________________________
>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>> --
>>>> Andrew Yates Ensembl Genomes Engineer
>>>> EMBL-EBI Tel: +44-(0)1223-492538
>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>
>> --
>> Andrew Yates Ensembl Genomes Engineer
>> EMBL-EBI Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>>
>>
>>
>>
>>
--
Andrew Yates Ensembl Genomes Engineer
EMBL-EBI Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
More information about the Biojava-l
mailing list