[Biojava-l] K-mers

Andy Yates ayates at ebi.ac.uk
Fri Oct 29 09:20:36 UTC 2010


Okay couple of points here:

1). Which biojava3 module? This sounds like something for the genomic module rather than core

2). It'll need some more work. I'm not happy about using the WindowedSequenceView in its current state. I think an alteration to avoid it making Lists would be a good idea (plus recent developments in the API as to its main use means this is a viable change). Also it should return the overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6

Comments?

Andy

On 29 Oct 2010, at 10:12, jitesh dundas wrote:

> Dear Friends,
> 
> Thanks to Vishal & Andy for this. I actually needed this code too..
> Vishal, I think Andy's suggestions may be a good option to include in
> BioJava 3. Would you like to add this to the BioJava 3.
> 
> Thanks again.
> 
> Regards,
> Jitesh Dundas
> 
> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Vishal,
>> 
>> As far as I am aware there is nothing which will generate them in BioJava at
>> the moment. However it is possible to do it with BioJava3:
>> 
>> public static void main(String[] args) {
>>    DNASequence d = new DNASequence("ATGATC");
>>    System.out.println("Non-Overlap");
>>    nonOverlap(d);
>>    System.out.println("Overlap");
>>    overlap(d);
>> }
>> 
>> public static final int KMER = 3;
>> 
>> //Generate triplets overlapping
>> public static void overlap(Sequence<NucleotideCompound> d) {
>>    List<WindowedSequence<NucleotideCompound>> l =
>>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>>    for(int i=1; i<=KMER; i++) {
>>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>                i, d.getLength());
>>        WindowedSequence<NucleotideCompound> w =
>>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>>        l.add(w);
>>    }
>> 
>>    //Will return ATG, ATC, TGA & GAT
>>    for(WindowedSequence<NucleotideCompound> w: l) {
>>        for(List<NucleotideCompound> subList: w) {
>>            System.out.println(subList);
>>        }
>>    }
>> }
>> 
>> //Generate triplet Compound lists non-overlapping
>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>    WindowedSequence<NucleotideCompound> w =
>>            new WindowedSequence<NucleotideCompound>(d, KMER);
>>    //Will return ATG & ATC
>>    for(List<NucleotideCompound> subList: w) {
>>        System.out.println(subList);
>>    }
>> }
>> 
>> The disadvantage of all of these solutions is that they generate lists of
>> Compounds so kmer generation can/will be a memory intensive operation. This
>> does mean it has to be since sub sequences are thin wrappers around an
>> underlying sequence. Also the overlap solution is non-optimal since it
>> iterates through each window rather than stepping through delegating onto
>> each base in turn (hence why we get ATG & ATC before TGA)
>> 
>> As for unique k-mers that's something which would require a bit more
>> engineering & would be better suited to a solution built around a Trie
>> (prefix tree).
>> 
>> Hope this helps,
>> 
>> Andy
>> 
>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>> 
>>> Hi All,
>>> 
>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>> counts for every sequence in a fasta file. If something like this exists
>>> it
>>> would save me some time to write the code.
>>> 
>>> Thanks,
>>> 
>>> Vishal
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the Biojava-l mailing list