[Biojava-l] K-mers

Sat Oct 30 10:50:48 UTC 2010

just kicked off a new build.. alpha4 should be on the servers
shortly... you don't need cruisecontrol for a release. Anybody with an
ssh account on portal.open-bio (and set up ssh keys correctly) can do
mvn release:clean release:prepare release:perform

A

On Sat, Oct 30, 2010 at 5:20 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight.
>
> Just goes to show you should always do more testing than you think :).
>
> Andy
>
> On 29 Oct 2010, at 20:43, jitesh dundas wrote:
>
>> That is good news.Thanks for the directions Andy.
>>
>> I have already started on this.Let me analyze and write the code now.
>>
>> Maybe a next month deadline is not unreachable in this case.
>>
>> Here we go!
>> JD
>>
>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> So we've got some basic kmer work now in SVN. If you look in the class
>>> SequenceMixin there are two static methods there for generating the two
>>> types of k-mers. It's not developed with Map storage in mind & I'll leave
>>> the door open there for anyone else to come in & develop it. The k-mers are
>>> also not unique across the sequence but it's a start :)
>>>
>>> Share & enjoy!
>>>
>>> Andy
>>>
>>> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
>>>
>>>> I agree Andy. These have become standard functionalities that
>>>> scientists do these days. I am all for implementing that in BioJava3.
>>>> Java isn't that efficient for such functionalities so we will surely
>>>> need more effort compared to the same in Python/Perl.
>>>>
>>>> Regards,
>>>> Jitesh Dundas
>>>>
>>>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> So if it's a suffix tree that's quite a fixed data structure so the
>>>>> chances
>>>>> of developing a pluggable mechanism there would be hard. I think there
>>>>> also
>>>>> has to be a limit as to what we can sensibly do. If people want to
>>>>> contribute this kind of work though then it's all be very well received
>>>>> (with the corresponding test environment/cases of course).
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Andy
>>>>>
>>>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>>>>>
>>>>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>>>>> would allow a developer to use anything from a simple MultiMap, to a
>>>>>> NoSQL
>>>>>> key-value database to store K-mers.  You could plugin custom map
>>>>>> implementations to allow you to keep a count of the number of instances
>>>>>> of
>>>>>> particular K-mers that were found.  It might also be useful to be able
>>>>>> to
>>>>>> do
>>>>>> set operations on those K-mer collections.  You could use it to
>>>>>> determine
>>>>>> which K-mers were present in a pathogen and not in a host.
>>>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> card.ly: <http://card.ly/phidias51>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>>>>> <vishalthapar at gmail.com>wrote:
>>>>>>
>>>>>>> Hi Andy,
>>>>>>>
>>>>>>> This is good to have. I feel that including it as a part of core may
>>>>>>> not
>>>>>>> be
>>>>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>>>>> nice.
>>>>>>> There is a project Bioinformatica
>>>>>>>
>>>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>>>>> does something similar although not exactly. It counts the k-mers in a
>>>>>>> given fasta file but it does not count k-mers for each sequence within
>>>>>>> the
>>>>>>> file, just all within a file. This is a good feature to have specially
>>>>>>> if
>>>>>>> one is trying to find patterns within sequences which is what I am
>>>>>>> trying
>>>>>>> to
>>>>>>> do. It would most certainly be helpful to have a k-mer counting
>>>>>>> algorithm
>>>>>>> that counts k-mer frequency for each sequence. The way to go would be
>>>>>>> to
>>>>>>> use
>>>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
>>>>>>> not
>>>>>>> since I haven't used java in a while and am just switching back to it.
>>>>>>> A
>>>>>>> paper on using suffix trees to generate genome wide k-mer frequencies
>>>>>>> is:
>>>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>>>>> software
>>>>>>> is tallymer). It would be some work to implement this in java as a
>>>>>>> module
>>>>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>>>>> fasta
>>>>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>>>>> files, I think that might be the way to go.
>>>>>>>
>>>>>>> Thats just my two cents.What do you think?
>>>>>>>
>>>>>>> -vishal
>>>>>>>
>>>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>>
>>>>>>>> Hi Vishal,
>>>>>>>>
>>>>>>>> As far as I am aware there is nothing which will generate them in
>>>>>>>> BioJava
>>>>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>>>>>
>>>>>>>> public static void main(String[] args) {
>>>>>>>> DNASequence d = new DNASequence("ATGATC");
>>>>>>>> System.out.println("Non-Overlap");
>>>>>>>> nonOverlap(d);
>>>>>>>> System.out.println("Overlap");
>>>>>>>> overlap(d);
>>>>>>>> }
>>>>>>>>
>>>>>>>> public static final int KMER = 3;
>>>>>>>>
>>>>>>>> //Generate triplets overlapping
>>>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>>>> List<WindowedSequence<NucleotideCompound>> l =
>>>>>>>>         new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>>>> for(int i=1; i<=KMER; i++) {
>>>>>>>>     SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>>>>             i, d.getLength());
>>>>>>>>     WindowedSequence<NucleotideCompound> w =
>>>>>>>>         new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>>>>     l.add(w);
>>>>>>>> }
>>>>>>>>
>>>>>>>> //Will return ATG, ATC, TGA & GAT
>>>>>>>> for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>>>>     for(List<NucleotideCompound> subList: w) {
>>>>>>>>         System.out.println(subList);
>>>>>>>>     }
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> //Generate triplet Compound lists non-overlapping
>>>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>>>> WindowedSequence<NucleotideCompound> w =
>>>>>>>>         new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>>>> //Will return ATG & ATC
>>>>>>>> for(List<NucleotideCompound> subList: w) {
>>>>>>>>     System.out.println(subList);
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>>>>> of
>>>>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>>>>> This
>>>>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>>>>> iterates through each window rather than stepping through delegating
>>>>>>>> onto
>>>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>>>>>
>>>>>>>> As for unique k-mers that's something which would require a bit more
>>>>>>>> engineering & would be better suited to a solution built around a Trie
>>>>>>>> (prefix tree).
>>>>>>>>
>>>>>>>> Hope this helps,
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>>>>> or
>>>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>>>>> k-mer
>>>>>>>>> counts for every sequence in a fasta file. If something like this
>>>>>>> exists
>>>>>>>> it
>>>>>>>>> would save me some time to write the code.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Vishal
>>>>>>>>> _______________________________________________
>>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>
>>>>>>>> --
>>>>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Vishal Thapar, Ph.D.*
>>>>>>> *Scientific informatics Analyst
>>>>>>> Cold Spring Harbor Lab
>>>>>>> Quick Bldg, Lowe Lab
>>>>>>> 1 Bungtown Road
>>>>>>> Cold Spring Harbor, NY - 11724*
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>>> --
>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>
>>>
>>>
>>>
>>>
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------