[Biojava-l] What is the maximum length for a Sequence

Thomas Down td2@sanger.ac.uk
Thu, 10 Oct 2002 21:39:18 +0100


On Thu, Oct 10, 2002 at 03:09:07PM -0400, Sylvain Foisy wrote:
> 
> I wrote a program that would read and load huge genomic sequence files 
> (in my case, the H. sapiens genomic contigs from GenBank) into Sequences 
> (one per contig) to select sections of it for further analysis. It seems 
> to load the Sequence OK but when I am creating SubSequence objects from 
> this Sequence with SubSequence(Sequence, int start,int finish), I am 
> getting an error message.
> 
> If my start and/or my finish is above 230000, I get an 
> IndexArrayOutOfBoundException. Since I am quite new at programming, I 
> must either do something totally hideous or I do not have the right 
> approach of dealing with such a huge sequence.

The default in-memory sequence implementation in BioJava isn't
as memory-efficient as it could be (there's an alternative
packed implementation which uses 2 or 4 bits per DNA base,
but this isn't currently used by default).  However, it's
capable of handling sequences of many megabases, even on relatively
modest hardware.  I've loaded whole chromosomes on my home machine
without a glitch.

In any case, the fact that you loaded the sequence without an
error suggests that there isn't a problem.  Sounds more like
an error in SubSequence, although I don't know exactly where.
Another possibility is an error in the Genbank parser which
is truncating the sequence.

Could you send me:

   - The full exception stacktrace (very helpful for debugging)
   - The version of BioJava you're using (1.22? something older?
     or a version from CVS?)
   - The ID of the sequence you've been testing.

Thanks,

     Thomas.