[BioRuby] Is there a limit to string / naseq length?

Zhou, Lixin LZhou at illumina.com
Tue Mar 2 00:14:05 EST 2004


I've just deleted some lines of annotation in the feature table in NT_005612 and found that the sequence is still truncated to 100,000,020 bp.  Therefore, the bug may have nothing to do with the number of lines in the RefSeq record.

Here is to correct the mistakes / typos in the previous message:

1. The sequence is from ORIGIN not SOURCE.
2. The sequence length is greater than 100 M bp.

-----Original Message-----
From:	Zhou, Lixin
Sent:	Mon 3/1/2004 6:39 PM
To:	bioruby at open-bio.org
Cc:	
Subject:	[BioRuby] Is there a limit to string / naseq length?
Hi all,

I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA
sequence from SOURCE is truncated.  This appears to be reproducible when
I "require \"bio/db/genbank/refseq\"".

The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
(longest in human RefSeq 34 v2 and the only one whose sequence is
greater than 1M bp).  I was parsing the entire RefSeq and then cutting
exon sequence and noticed a few NM / XM entries returned empty sequence
from NT_005612.  A careful examination indicate that their coordinates
are greater than 100,000,000.  I tried to print out gb.naseq and indeed,
the sequence is truncated to about 100,000,020.  By the way, it appears
bioruby takes only the first 2575408 lines of the entire RefSeq record -
because 100,000,021st base starts at the line 2,575,409 of the NT
record.

I briefly checked bioruby source and have not found a limit to the
sequence length.  Is this a bug from Ruby 1.8.1, which I use?

Thanks.

Lixin Zhou
lzhou at illumina.com

_______________________________________________
BioRuby mailing list
BioRuby at open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioruby





More information about the BioRuby mailing list