[Biojava-dev] EnsemblApi use case for DNASequences

Andy Yates ayates at ebi.ac.uk
Thu May 13 13:31:15 UTC 2010


Not at the moment. The 2bit implementation has a worker and has been built with the idea that it _could_ be extended to as you say a 4bit implementation. If it were written I wouldn't keep it to just DNA or RNA but to any CompoundSet with 16 or less compounds. 

Andy

On 13 May 2010, at 14:20, Peter wrote:

> On Thu, May 13, 2010 at 1:38 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> 
>> 
>> As you said at the end of your email the best way to accomplish this
>> is by creating a SeqeunceProxyReader which can do all this logic
>> and lets you work with the "right" objects and not have to re-implement
>> that code. Now this leaves a few alternatives to how you can represent
>> this in memory. We already have a 2bit implementation (will be called
>> TwoBitSequenceReader) for storing very large pieces of Sequence
>> but that only has support for ACGT and no support for gaps or Ns.
>> This could be extended to bring in support for these as features or
>> you could materialise that sequence and then push it into another
>> Sequence object I have been working with (unchecked in atmo)
>> which lets you join Sequences together. This combined with a
>> Sequence which returns Compounds of a particular type e.g. Ns for
>> any given length would let you represent massive amounts of
>> Sequence in a very small amount of space. All of these updates
>> will be in place soon but I cannot say exactly when
> 
> Does BioJava have a 4bit sequence implementation for ambiguous
> DNA (or RNA)? That would let you treat N as 1111 (all four bits set)
> and a gap as 0000 (none of the bits set).
> 
> Peter

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the biojava-dev mailing list