[BioRuby] Is there a limit to string / naseq length?

Zhou, Lixin LZhou at illumina.com
Mon Mar 1 21:39:51 EST 2004


Hi all,

I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA
sequence from SOURCE is truncated.  This appears to be reproducible when
I "require \"bio/db/genbank/refseq\"".

The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
(longest in human RefSeq 34 v2 and the only one whose sequence is
greater than 1M bp).  I was parsing the entire RefSeq and then cutting
exon sequence and noticed a few NM / XM entries returned empty sequence
from NT_005612.  A careful examination indicate that their coordinates
are greater than 100,000,000.  I tried to print out gb.naseq and indeed,
the sequence is truncated to about 100,000,020.  By the way, it appears
bioruby takes only the first 2575408 lines of the entire RefSeq record -
because 100,000,021st base starts at the line 2,575,409 of the NT
record.

I briefly checked bioruby source and have not found a limit to the
sequence length.  Is this a bug from Ruby 1.8.1, which I use?

Thanks.

Lixin Zhou
lzhou at illumina.com



More information about the BioRuby mailing list