[Biojava-l] K-mers

jitesh dundas jbdundas at gmail.com
Fri Oct 29 10:00:44 UTC 2010


Dear Sir,

Is there any way to detect patterns in the recorded k-mers .

I have a large set of miRNAs (study for mutations and patgerns for
gastric cancer).I made a record of k-mers for each sequence but the
patterns that are generated are difficult to track.

Can BioJava do this point. Regular Expressions in Java maybe useful here..

Request expert advise  in this.Any other s/w that might be useful.

Thanks,
Jitesh Dundas

On 10/29/10, jitesh dundas <jbdundas at gmail.com> wrote:
> Dear Friends,
>
> Thanks to Vishal & Andy for this. I actually needed this code too..
> Vishal, I think Andy's suggestions may be a good option to include in
> BioJava 3. Would you like to add this to the BioJava 3.
>
> Thanks again.
>
> Regards,
> Jitesh Dundas
>
> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Vishal,
>>
>> As far as I am aware there is nothing which will generate them in BioJava
>> at
>> the moment. However it is possible to do it with BioJava3:
>>
>> public static void main(String[] args) {
>>     DNASequence d = new DNASequence("ATGATC");
>>     System.out.println("Non-Overlap");
>>     nonOverlap(d);
>>     System.out.println("Overlap");
>>     overlap(d);
>> }
>>
>> public static final int KMER = 3;
>>
>> //Generate triplets overlapping
>> public static void overlap(Sequence<NucleotideCompound> d) {
>>     List<WindowedSequence<NucleotideCompound>> l =
>>             new ArrayList<WindowedSequence<NucleotideCompound>>();
>>     for(int i=1; i<=KMER; i++) {
>>         SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>                 i, d.getLength());
>>         WindowedSequence<NucleotideCompound> w =
>>             new WindowedSequence<NucleotideCompound>(sub, KMER);
>>         l.add(w);
>>     }
>>
>>     //Will return ATG, ATC, TGA & GAT
>>     for(WindowedSequence<NucleotideCompound> w: l) {
>>         for(List<NucleotideCompound> subList: w) {
>>             System.out.println(subList);
>>         }
>>     }
>> }
>>
>> //Generate triplet Compound lists non-overlapping
>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>     WindowedSequence<NucleotideCompound> w =
>>             new WindowedSequence<NucleotideCompound>(d, KMER);
>>     //Will return ATG & ATC
>>     for(List<NucleotideCompound> subList: w) {
>>         System.out.println(subList);
>>     }
>> }
>>
>> The disadvantage of all of these solutions is that they generate lists of
>> Compounds so kmer generation can/will be a memory intensive operation.
>> This
>> does mean it has to be since sub sequences are thin wrappers around an
>> underlying sequence. Also the overlap solution is non-optimal since it
>> iterates through each window rather than stepping through delegating onto
>> each base in turn (hence why we get ATG & ATC before TGA)
>>
>> As for unique k-mers that's something which would require a bit more
>> engineering & would be better suited to a solution built around a Trie
>> (prefix tree).
>>
>> Hope this helps,
>>
>> Andy
>>
>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>
>>> Hi All,
>>>
>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>> counts for every sequence in a fasta file. If something like this exists
>>> it
>>> would save me some time to write the code.
>>>
>>> Thanks,
>>>
>>> Vishal
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>
>>
>>
>>
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>



More information about the Biojava-l mailing list