[Bioperl-l] SeqIO out of memory
Brian Osborne
brian_osborne at cognia.com
Mon Mar 3 12:24:57 EST 2003
Ewan,
Here's the relevant section:
misc_feature 1553..1564
/gene="aap-1"
/note="Region: [PSORT] ER membrane domain: KSIP"
misc_feature bond(174,175)
/gene="aap-1"
/note="Intron length 47 bp, type gt_ag"
exon 175..434
/gene="aap-1"
/note="Exon 2 length 260 bp"
misc_feature bond(434,435)
/gene="aap-1"
/note="Intron length 181 bp, type gt_ag"
exon 435..700
/gene="aap-1"
/note="Exon 3 length 266 bp"
misc_feature bond(700,701)
/gene="aap-1"
/note="Intron length 227 bp, type gt_ag"
exon 701..1132
/gene="aap-1"
/note="Exon 4 length 432 bp"
misc_feature bond(1132,1133)
/gene="aap-1"
/note="Intron length 80 bp, type gt_ag"
exon 1133..1802
/gene="aap-1"
/note="Exon 5 length 670 bp"
3'UTR 1571..1802
/gene="aap-1"
/evidence=experimental
polyA_signal 1789..1794
/gene="aap-1"
/note="variant attaaa"
BASE COUNT 560 a 368 c 355 g 519 t
ORIGIN
1 aatgagcaca acacctggaa ctcctcatgg agtcactcat agtcttatgg aacaaggatg
Yes, it should not be there but there it is. I'm guessing that the thought
is that "bond" is useful for mRNAs, indicating the positions of the exons on
the sequence. In the RefSeq file I have there are >100,000 entries and some
7,000 misc_feature's with "bond", all of them <num>,<num>, not surprisingly.
The documentation you've provided shows that the general format
(<num>,<num>) is a valid one, whether to handle "bond" is another
question....
Brian O.
-----Original Message-----
From: bioperl-l-bounces at bioperl.org [mailto:bioperl-l-bounces at bioperl.org]On
Behalf Of Ewan Birney
Sent: Monday, March 03, 2003 11:20 AM
To: Brian Osborne
Cc: Hilmar Lapp; Bioperl
Subject: RE: [Bioperl-l] SeqIO out of memory
On Mon, 3 Mar 2003, Brian Osborne wrote:
> Ewan and Hilmar,
> >> Aha. This a better explanation. ;)
>
> Not so fast there my friends, I now have a _real_ bug! ;-)
> Check this one out:
> ~/data/refseq>perl -e 'use Bio::SeqIO; $in =
> Bio::SeqIO->new(-file=>"aap-1.gb",-
> format => "genbank" ); open MYOUT,">aap-1.fa"; while ( $seq =
> $in->next_seq ){ p
> rint MYOUT ">",$seq->accession_number,"\n",$seq->seq,"\n"; }'
>
> -------------------- WARNING ---------------------
> MSG: exception while parsing location line [bond(174,175)] in reading
> EMBL/GenBa
> nk/SwissProt, ignoring feature misc_feature (seqid=aap-1):
> ------------- EXCEPTION -------------
> MSG: operator "bond" unrecognized by parser
Well... is this a bug?
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#op
erators
bond is not in there. Can you send the entire line?
> STACK Bio::Factory::FTLocationFactory::from_string
> /usr/lib/perl5/site_perl/5.8.
> 0/Bio/Factory/FTLocationFactory.pm:160
> STACK (eval) /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/FTHelper.pm:124
> STACK Bio::SeqIO::FTHelper::_generic_seqfeature
> /usr/lib/perl5/site_perl/5.8.0/B
> io/SeqIO/FTHelper.pm:123
> STACK Bio::SeqIO::genbank::next_seq
> /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/gen
> bank.pm:394
> STACK toplevel -e:1
>
> So Factory/FTLocationFactory only wants to see the operators "complement",
> "join", or "order" in Genbank files, but I have a Genbank file with "bond"
> in that same position (NM_059121, C. elegans aap-1):
>
> ~/data/refseq>grep bond aap-1.gb
> misc_feature bond(174,175)
> misc_feature bond(434,435)
> misc_feature bond(700,701)
> misc_feature bond(1132,1133)
>
> It's easy to change line 160 in FTLocationFactory from
> "} elsif(($op eq "join") || ($op eq "order") ) {"
> to
> "} elsif(($op eq "join") || ($op eq "order") ) || ($op eq "bond")) {
> and now the file is parsed without complaint, but does SplitLocation
> correctly handle "(<num>,<num>)" in addition to
> "(<num>..<num>,<num>..<num>)"?
>
> Brian O.
>
>
>
> > -hilmar
> >
> > On Friday, February 28, 2003, at 01:06 PM, Brian Osborne wrote:
> >
> > > Bioperl-l,
> > > Check out this one-liner, where the input file is rscu.gbff, a
> > > Genbank-formatted file with 111,220 entries. The fasta file that's
made
> > > contains only 42,451 entries. Is "Out of memory" the expected result
> > > for an
> > > input file this size?
> > >
> > > ~/data/refseq>perl -e 'use Bio::SeqIO; $in =
> > > Bio::SeqIO->new(-file=>"rscu.gbff",
> > > -format=>"genbank"); open MYOUT,">rscu.fa"; while ( $seq =
> > > in->next_seq ){ print
> > > MYOUT ">" . $seq->accession_number . "\n" . $seq->seq . "\n"; }'
> > >
> > > Out of memory during "large" request for 33558528 bytes, total sbrk()
> > > is
> > > 3822837
> > > 76 bytes at /usr/lib/perl5/site_perl/5.8.0/Bio/Seq/RichSeq.pm line
114,
> > > <GEN0> l
> > > ine 6433958.
> > >
> > > Brian O.
> > >
> > >
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-l
> > >
> > --
> > -------------------------------------------------------------
> > Hilmar Lapp email: lapp at gnf.org
> > GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
> > -------------------------------------------------------------
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
>
>
-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney at ebi.ac.uk>.
-----------------------------------------------------------------
_______________________________________________
Bioperl-l mailing list
Bioperl-l at bioperl.org
http://bioperl.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list