[Biojava-l] K-mers

Andy Yates ayates at ebi.ac.uk
Fri Oct 29 10:09:11 UTC 2010


One of the disadvantages of the Sequence based system is that we have no support for searching in sequences with patterns like regular expressions. Whilst it's possible to convert a Sequence into a String & then perform the expression but that is a sub-optimal solution.

Looking at the Pattern code in Java6 it can take in a CharSequence which one could write an adaptor to make a Sequence act as a CharSequence for the matching procedure but really it looks like a lot of work.

As for a way of doing matching to sequence HMMER3 is awesome :)

Andy

On 29 Oct 2010, at 11:00, jitesh dundas wrote:

> Dear Sir,
> 
> Is there any way to detect patterns in the recorded k-mers .
> 
> I have a large set of miRNAs (study for mutations and patgerns for
> gastric cancer).I made a record of k-mers for each sequence but the
> patterns that are generated are difficult to track.
> 
> Can BioJava do this point. Regular Expressions in Java maybe useful here..
> 
> Request expert advise  in this.Any other s/w that might be useful.
> 
> Thanks,
> Jitesh Dundas
> 
> On 10/29/10, jitesh dundas <jbdundas at gmail.com> wrote:
>> Dear Friends,
>> 
>> Thanks to Vishal & Andy for this. I actually needed this code too..
>> Vishal, I think Andy's suggestions may be a good option to include in
>> BioJava 3. Would you like to add this to the BioJava 3.
>> 
>> Thanks again.
>> 
>> Regards,
>> Jitesh Dundas
>> 
>> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> Hi Vishal,
>>> 
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at
>>> the moment. However it is possible to do it with BioJava3:
>>> 
>>> public static void main(String[] args) {
>>>    DNASequence d = new DNASequence("ATGATC");
>>>    System.out.println("Non-Overlap");
>>>    nonOverlap(d);
>>>    System.out.println("Overlap");
>>>    overlap(d);
>>> }
>>> 
>>> public static final int KMER = 3;
>>> 
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>    List<WindowedSequence<NucleotideCompound>> l =
>>>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>    for(int i=1; i<=KMER; i++) {
>>>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>                i, d.getLength());
>>>        WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>        l.add(w);
>>>    }
>>> 
>>>    //Will return ATG, ATC, TGA & GAT
>>>    for(WindowedSequence<NucleotideCompound> w: l) {
>>>        for(List<NucleotideCompound> subList: w) {
>>>            System.out.println(subList);
>>>        }
>>>    }
>>> }
>>> 
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>    WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(d, KMER);
>>>    //Will return ATG & ATC
>>>    for(List<NucleotideCompound> subList: w) {
>>>        System.out.println(subList);
>>>    }
>>> }
>>> 
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>> 
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>> 
>>> Hope this helps,
>>> 
>>> Andy
>>> 
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>>> counts for every sequence in a fasta file. If something like this exists
>>>> it
>>>> would save me some time to write the code.
>>>> 
>>>> Thanks,
>>>> 
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the Biojava-l mailing list