[BioRuby] Is there a limit to string / naseq length?

Zhou, Lixin LZhou at illumina.com
Tue Mar 2 13:15:19 EST 2004


Thank you very much for your quick and great work!  I'll try it out.

> -----Original Message-----
> From: Toshiaki Katayama [mailto:ktym at hgc.jp] 
> Sent: Tuesday, March 02, 2004 10:11 AM
> To: BioRuby Discussion List Project
> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
> 
> 
> Hi,
> 
> Following change affects all sub-classes of the Bio::NCBIDB 
> and I have changed regexp in bio/db.rb to match top level tag 
> from /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits.
> 
> Plus, sequence extraction became faster by replacing gsub 
> with tr in genbank.rb.
> 
> Try these changes in CVS and please report if break anything.
> 
> 
> Lixin, thank you for your report.
> 
> Regards,
> Toshiaki Katayama
> 
> On 2004/03/03, at 1:58, Zhou, Lixin wrote:
> 
> > Hi,
> >
> > Thanks for pinpointing the bug.  I was just checking 
> > bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS 
> > line was tokenized using the GenBank "definition".  Apparently, 
> > GenBank will have to break their rules soon or later.  
> Perhaps we can 
> > simply split the line as long as the total number of fields remains 
> > the same?
> >
> > Thanks!
> >
> > Lixin Zhou
> >
> >> -----Original Message-----
> >> From: Toshiaki Katayama [mailto:ktym at hgc.jp]
> >> Sent: Tuesday, March 02, 2004 12:55 AM
> >> To: bioruby at open-bio.org
> >> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
> >>
> >>
> >> Hi,
> >>
> >> I have confirmed this also occurs on my OS X and Linux box 
> with Ruby 
> >> 1.6.8 and 1.8.1 by parsing the following file.
> >>
> >>
> >> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
> >> r3.gbk.gz
> >>
> >> My implementation of GenBank parser and Bio::Sequence 
> classes doesn't 
> >> limit sequence length.
> >>
> >> ...however...
> >>
> >> The problem was that I couldn't imagine the sequence coordination 
> >> number in the NCBI GenBank format can reach at the line 
> head when I 
> >> wrote bio/db.rb so that it misses lines after 100000021.
> >>
> >> --------------------------------------------------------------
> >> ----------
> >> ------
> >> LOCUS       NT_005612          100530261 bp    DNA     linear   CON
> >> 23-JAN-2004
> >> DEFINITION  Homo sapiens chromosome 3 genomic contig.
> >> ACCESSION   NT_005612
> >> (snip)
> >> ORIGIN
> >>          1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt 
> >> atgtgaacat
> >>         61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct 
> >> cagtcaaaag
> >> (snip)
> >>   99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg 
> >> atctccccca 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg 
> >> gcaagagata tccactggtt
> >> (snip)
> >> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat
> >> actcccatct
> >> 100530241 tgttcatgat tattctgaat t
> >> //
> >> --------------------------------------------------------------
> >> ----------
> >> ------
> >>
> >> I will fix this in the CVS although it may take some time 
> to be done.
> >>
> >> Sorry for the inconvenience,
> >> Toshiaki Katayama
> >>
> >>
> >> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
> >>
> >>> I've just deleted some lines of annotation in the feature 
> table in 
> >>> NT_005612 and found that the sequence is still truncated to 
> >>> 100,000,020 bp.  Therefore, the bug may have nothing to do
> >> with the
> >>> number of lines in the RefSeq record.
> >>>
> >>> Here is to correct the mistakes / typos in the previous message:
> >>>
> >>> 1. The sequence is from ORIGIN not SOURCE.
> >>> 2. The sequence length is greater than 100 M bp.
> >>>
> >>> -----Original Message-----
> >>> From:	Zhou, Lixin
> >>> Sent:	Mon 3/1/2004 6:39 PM
> >>> To:	bioruby at open-bio.org
> >>> Cc:	
> >>> Subject:	[BioRuby] Is there a limit to string / naseq length?
> >>> Hi all,
> >>>
> >>> I was parsing NCBI's human RefSeq 34 version 2 and 
> noticed that the 
> >>> DNA sequence from SOURCE is truncated.  This appears to be
> >> reproducible
> >>> when
> >>> I "require \"bio/db/genbank/refseq\"".
> >>>
> >>> The length of the NT_005612 sequence from CHR_03 is 
> 100,530,261 bp 
> >>> (longest in human RefSeq 34 v2 and the only one whose sequence is 
> >>> greater than 1M bp).  I was parsing the entire RefSeq and
> >> then cutting
> >>> exon sequence and noticed a few NM / XM entries returned empty 
> >>> sequence from NT_005612.  A careful examination indicate 
> that their 
> >>> coordinates are greater than 100,000,000.  I tried to print
> >> out gb.naseq and
> >>> indeed,
> >>> the sequence is truncated to about 100,000,020.  By the
> >> way, it appears
> >>> bioruby takes only the first 2575408 lines of the entire
> >> RefSeq record
> >>> -
> >>> because 100,000,021st base starts at the line 2,575,409 of the NT 
> >>> record.
> >>>
> >>> I briefly checked bioruby source and have not found a 
> limit to the 
> >>> sequence length.  Is this a bug from Ruby 1.8.1, which I use?
> >>>
> >>> Thanks.
> >>>
> >>> Lixin Zhou
> >>> lzhou at illumina.com
> >>>
> >>> _______________________________________________
> >>> BioRuby mailing list
> >>> BioRuby at open-bio.org 
> >>> http://portal.open-bio.org/mailman/listinfo/bioruby
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> BioRuby mailing list
> >>> BioRuby at open-bio.org 
> >>> http://portal.open-bio.org/mailman/listinfo/bioruby
> >>
> >> _______________________________________________
> >> BioRuby mailing list
> >> BioRuby at open-bio.org
> >> http://portal.open-> bio.org/mailman/listinfo/bioruby
> >>
> >
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby at open-bio.org 
> > http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby at open-bio.org 
> http://portal.open-bio.org/mailman/listinfo/bioruby
> 


More information about the BioRuby mailing list