[BioRuby] Is there a limit to string / naseq length?

Tue Mar 2 13:11:07 EST 2004

Hi,

Following change affects all sub-classes of the Bio::NCBIDB and
I have changed regexp in bio/db.rb to match top level tag from
/\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits.

Plus, sequence extraction became faster by replacing gsub with
tr in genbank.rb.

Try these changes in CVS and please report if break anything.

Lixin, thank you for your report.

Regards,
Toshiaki Katayama

On 2004/03/03, at 1:58, Zhou, Lixin wrote:

> Hi,
>
> Thanks for pinpointing the bug.  I was just checking
> bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line
> was tokenized using the GenBank "definition".  Apparently, GenBank will
> have to break their rules soon or later.  Perhaps we can simply split
> the line as long as the total number of fields remains the same?
>
> Thanks!
>
> Lixin Zhou
>
>> -----Original Message-----
>> From: Toshiaki Katayama [mailto:ktym at hgc.jp]
>> Sent: Tuesday, March 02, 2004 12:55 AM
>> To: bioruby at open-bio.org
>> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
>>
>>
>> Hi,
>>
>> I have confirmed this also occurs on my OS X and Linux box
>> with Ruby 1.6.8 and 1.8.1 by parsing the following file.
>>
>>
>> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
>> r3.gbk.gz
>>
>> My implementation of GenBank parser and Bio::Sequence classes
>> doesn't limit sequence length.
>>
>> ...however...
>>
>> The problem was that I couldn't imagine the sequence
>> coordination number in the NCBI GenBank format can reach at
>> the line head when I wrote bio/db.rb so that it misses lines
>> after 100000021.
>>
>> --------------------------------------------------------------
>> ----------
>> ------
>> LOCUS       NT_005612          100530261 bp    DNA     linear   CON
>> 23-JAN-2004
>> DEFINITION  Homo sapiens chromosome 3 genomic contig.
>> ACCESSION   NT_005612
>> (snip)
>> ORIGIN
>>          1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt
>> atgtgaacat
>>         61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct
>> cagtcaaaag
>> (snip)
>>   99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg
>> atctccccca
>> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata
>> tccactggtt
>> (snip)
>> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat
>> actcccatct
>> 100530241 tgttcatgat tattctgaat t
>> //
>> --------------------------------------------------------------
>> ----------
>> ------
>>
>> I will fix this in the CVS although it may take some time to be done.
>>
>> Sorry for the inconvenience,
>> Toshiaki Katayama
>>
>>
>> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
>>
>>> I've just deleted some lines of annotation in the feature table in
>>> NT_005612 and found that the sequence is still truncated to
>>> 100,000,020 bp.  Therefore, the bug may have nothing to do
>> with the
>>> number of lines in the RefSeq record.
>>>
>>> Here is to correct the mistakes / typos in the previous message:
>>>
>>> 1. The sequence is from ORIGIN not SOURCE.
>>> 2. The sequence length is greater than 100 M bp.
>>>
>>> -----Original Message-----
>>> From:	Zhou, Lixin
>>> Sent:	Mon 3/1/2004 6:39 PM
>>> To:	bioruby at open-bio.org
>>> Cc:	
>>> Subject:	[BioRuby] Is there a limit to string / naseq length?
>>> Hi all,
>>>
>>> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the
>>> DNA sequence from SOURCE is truncated.  This appears to be
>> reproducible
>>> when
>>> I "require ＼"bio/db/genbank/refseq＼"".
>>>
>>> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
>>> (longest in human RefSeq 34 v2 and the only one whose sequence is
>>> greater than 1M bp).  I was parsing the entire RefSeq and
>> then cutting
>>> exon sequence and noticed a few NM / XM entries returned empty
>>> sequence from NT_005612.  A careful examination indicate that their
>>> coordinates are greater than 100,000,000.  I tried to print
>> out gb.naseq and
>>> indeed,
>>> the sequence is truncated to about 100,000,020.  By the
>> way, it appears
>>> bioruby takes only the first 2575408 lines of the entire
>> RefSeq record
>>> -
>>> because 100,000,021st base starts at the line 2,575,409 of the NT
>>> record.
>>>
>>> I briefly checked bioruby source and have not found a limit to the
>>> sequence length.  Is this a bug from Ruby 1.8.1, which I use?
>>>
>>> Thanks.
>>>
>>> Lixin Zhou
>>> lzhou at illumina.com
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby at open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby at open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at open-bio.org
>> http://portal.open-> bio.org/mailman/listinfo/bioruby
>>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby