[Biojava-dev] EnsemblApi use case for DNASequences

Andy Yates ayates at ebi.ac.uk
Thu May 13 13:48:02 UTC 2010


I did like the UCSC .2bit format but this was originally written because of a use-case when someone has stored DNA in a DB using 2bit encoding but not a .2bit file. The file format does handle gaps and Ns very well but that's only because it stores where the run of those features are. 

The 2bit sequence reader does not do that at the moment however it was developed with that kind of extension in mind. When it is unable to translate a String into a Compound it triggers a method which normally will throw an exception. It would be possible to override this and then provide identical functionality to 2bit but the process of doing this scared me :). Plus there's other things to be getting on with and I'm happy to leave it in a state which means it could be extended.

Andy

On 13 May 2010, at 14:43, Thomas Down wrote:

> On Thu, May 13, 2010 at 1:38 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> We already have a 2bit implementation (will be called TwoBitSequenceReader) for storing very large pieces of Sequence but that only has support for ACGT and no support for gaps or Ns.
> 
> If you haven't already, I'd recommend taking a look at how the UCSC .2bit file format handles Ns.  Quite elegant, and seems to cover most genomic use cases very efficiently.  I've got a BioJava (1.x, I'm afraid) SequenceDB implementation that's backed by a .2bit file (in a MappedByteBuffer) if you're curious.
> 
>                Thomas.
> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the biojava-dev mailing list