[Bioperl-l] Memory requirements for conversion from embl to genbank
Chris Fields
cjfields at uiuc.edu
Thu Aug 31 14:13:31 UTC 2006
Martin,
That's the issue; I believe the tags are supposed to be unique (part of the
EMBL standard, I think). I'll look at it but this may be, again, one of
those issues which we may not fix as it's a problem with the input sequence
(not in the correct format).
At the very least it should break out of an infinite loop with a thrown
message. Have you tried adding a debugging statement to the specific line
in genbank.pm to verify the infinite loop?
Wow, you've run into a hornet's nest of bad sequences. Missing quotes, too
many quotes, now this!
Chris
> -----Original Message-----
> From: Martin MOKREJŠ [mailto:mmokrejs at ribosome.natur.cuni.cz]
> Sent: Thursday, August 31, 2006 8:50 AM
> To: Chris Fields
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Memory requirements for conversion from embl to
> genbank
>
> It has slowed down after printing out, actually it stopped printing
> out the text (but that could be because the output is buffered, hmm,
> there used to be a way to unset buffering, I used to know the contents
> of 'man perlopentut' some years ago, but its gone from my head now):
>
> Acc:BB133146
> Acc:BB199913
> Acc:BB199915
> Acc:BB199667
> Acc:BB199670
> Acc:BB199673
> Acc:BB199676
> Acc:BB199679
> Acc:BB199682
> Acc:BB228934
> Acc:BB229388
> Acc:BB229266
> Acc:BB229267
> Acc:BB199709
> Acc:BB199710
> Acc:BB199711
> Acc:BB199712
> Acc:BB200048
> Acc:BB199986
> Acc:BB199993
>
>
> It hasn't died yet, but I guess it will in a while. The next record
> which it did not spit out is:
>
> ID 5HGB000664 standard; mRNA; VRL; 1892 BP.
> XX
> AC BB199698;
> XX
> DT 20-NOV-2002 (Rel. 16, Created)
> DT 20-NOV-2002 (Rel. 16, Last updated, Version 1)
> XX
> DE 5'UTR in Hepatitis GB virus B subgenomic replicon neoRepB
> XX
> DR EMBL; AJ428955;
> DR UTR; CC221018;
> XX
> OS Hepatitis GB virus B
> OS Encephalomyocarditis virus
> OC Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
> OC Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
> OC Cardiovirus.
> XX
> UT 5'UTR;
> XX
> FH Key Location/Qualifiers
> FH
> FT 5'UTR 1..1892
> FT /source="EMBL::AJ428955:1..1892"
> FT /product="non-structural polyprotein"
> FT VECTOR 477..1274
> FT /source="EMBL::AJ428955:477..1274"
> FT /evidence="Similarity"
> FT /db_xref="EMBL:"
> FT /note="Possible vector contamination"
> FT /note="Length=798 BP. Identities=99.6%"
> XX
>
>
> Note the two /note feature lines. I guess the quoting code loops over
> and keeps adding quote after a quote. ;-)
>
> M.
>
>
>
>
> Chris Fields wrote:
> > Martin,
> >
> > Do you get the same issue using SeqIO?
> >
> > #/usr/bin/perl -w
> >
> > use strict;
> > use warnings;
> > use Bio::SeqIO;
> >
> > $file_in = '5UTR.Vrl_nr.dat';
> >
> > $file_out = '5UTR.Vrl_nr.gb';
> >
> > my $seqin = Bio::SeqIO->new(-format => 'embl',
> > -file => "<$file_in");
> >
> > my $seqout = Bio::SeqIO->new(-format => 'genbank',
> > -file => ">$file_out");
> >
> > while (my $seq = $seqin->next_seq) {
> > print "Acc:",$seq->accession,"\n";
> > $seqout->write_seq($seq);
> > }
> >
> >
> > Chris
> >
> >
> > On Aug 31, 2006, at 7:44 AM, Martin MOKREJŠ wrote:
> >
> >> Hi,
> >> I use bp_sreformat.pl to convert a file from embl format
> >> to genbank. I use current cvs HEAD version and cannot parse
> >> two files. Each record is small and I don't understand why
> >> is the such a huge memory requirement. The machine has 1GB
> >> RAM and running recent recent linux kernel. Moreover, I could
> >> parse the same file with bioperl-1.5.1 when I have manually
> >> fixed some missing quotes in the file.
> >>
> >> With current changes to the embl & genbank parsing (bug #2077)
> >> I no longer can parse the file.
> >>
> >> Here is the memory status at the moment when the machine ran
> >> out of memory and linux kernel killed the application:
> >>
> >> 1 0 803212 20936 8 2184 0 0 0 0 1062 38 99
> >> 1 0 0
> >> 1 0 803208 19944 8 2184 0 0 0 0 1062 38
> >> 100 0 0 0
> >> 1 0 803208 18828 8 2184 0 0 0 0 1061 37
> >> 100 0 0 0
> >> 1 0 803204 17836 8 2184 0 0 0 0 1062 40
> >> 100 0 0 0
> >> 1 0 803204 16844 8 2184 0 0 0 0 1062 48
> >> 100 0 0 0
> >> 1 0 803200 15728 8 2184 32 0 32 0 1063 41
> >> 100 0 0 0
> >> 1 0 803200 14736 8 2184 0 0 0 0 1062 41 99
> >> 1 0 0
> >> 1 0 803196 13744 8 2184 0 0 0 0 1061 38
> >> 100 0 0 0
> >> 1 0 803240 13640 8 2184 0 48 0 48 1063 68 99
> >> 1 0 0
> >> 1 1 803240 12920 8 1984 0 40 0 40 1065 136
> >> 100 0 0 0
> >> 1 1 803240 13192 8 1872 0 1056 0 1056 1114 326 96
> >> 4 0 0
> >> 1 1 803240 14448 8 1336 0 20 0 20 1081 192 90
> >> 10 0 0
> >> 1 1 803240 13656 8 1232 0 28 0 28 1070 104 87
> >> 13 0 0
> >> 1 1 803240 12892 8 1260 32 4 176 4 1069 113 86
> >> 14 0 0
> >> 0 4 803240 12144 8 1344 192 24 612 24 1088 185 44
> >> 16 0 40
> >> 0 7 803240 11952 8 1180 32 32 508 32 1113 591 46
> >> 23 0 32
> >> 0 3 803240 11948 8 1336 1120 500 10816 500 4390 1397 2
> >> 31 0 66
> >> 2 6 803240 12056 8 1788 752 136 9412 136 6101 1795 0
> >> 27 0 73
> >> 0 7 803240 12176 8 1748 12 0 2180 0 1132 326 0
> >> 20 0 80
> >> procs -----------memory---------- ---swap-- -----io---- -system--
> >> ----cpu----
> >> r b swpd free buff cache si so bi bo in cs us
> >> sy id wa
> >> 0 5 803240 12492 8 1508 136 32 7508 32 2610 865 4
> >> 45 0 51
> >> 0 6 803240 12056 8 2004 64 8 1456 8 1138 312 9
> >> 18 0 73
> >> 1 6 803240 12668 8 1452 96 28 14856 28 2434 658 0
> >> 31 0 69
> >> 0 7 803240 13240 8 564 0 0 3112 0 4602 1492 4
> >> 38 0 58
> >> 0 10 803240 12768 8 688 36 15272 6000 15272 2026 431 26
> >> 39 0 35
> >> 0 2 81780 966512 8 5692 108 0 2904 0 2204 372 0
> >> 11 0 89
> >> 0 3 81780 966204 8 6056 128 0 488 3 1155 82 1
> >> 0 0 99
> >> 0 1 81780 965460 8 6260 492 0 696 0 1150 161 0
> >> 1 13 86
> >> 0 1 81732 963652 8 7860 8 0 1608 0 1147 199 1
> >> 2 42 55
> >> 0 1 81732 962052 8 8560 4 0 704 0 1129 177 6
> >> 1 43 50
> >> 0 1 81732 960120 8 9128 0 0 568 0 1124 161 12
> >> 2 57 29
> >> 0 1 81732 957512 8 9840 4 0 716 0 1137 191 13
> >> 2 27 58
> >> 1 0 81732 954992 8 10640 32 0 832 0 1135 191 14
> >> 1 47 38
> >> 1 0 81732 952824 8 11016 0 0 340 0 1096 128 64
> >> 1 18 16
> >> 1 0 81732 952152 8 11092 0 0 0 0 1062 80 99
> >> 1 0 0
> >> 1 0 81732 951424 8 11196 0 0 0 0 1062 105 99
> >> 1 0 0
> >> 1 0 81732 950808 8 11264 0 0 0 0 1062 74 99
> >> 1 0 0
> >>
> >>
> >> $ bp_sreformat.pl -if embl -of genbank -i 5UTR.Vrl_nr.dat -o
> >> 5UTR.Vrl_nr.gb
> >> Killed
> >> $
> >>
> >> The file can be obtained from ftp://bighost.ba.itb.cnr.it-fixed/pub/
> >> Embnet/Database/UTR/data/
> >>
> >> I am not a perl guru so nor am familiar with bioperl code. Does
> >> someone know
> >> whether the parsed records are held in the memory or not? It seems so.
> >> I guess deleting the objects from memory can be done by dereferencing
> >> them after they get written down in the new format immediately. Or,
> the
> >> garbage collector does not work well in perl 5.8.8.
> >>
> >> Thanks for any help.
> >> Martin
> >>
> >> --
> >> Dr. Martin Mokrejs
> >> Faculty of Science, Charles University
> >> Vinicna 5, 128 43 Prague, Czech Republic
> >> http://www.iresite.org
> >> http://www.iresite.org/~mmokrejs
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> > Christopher Fields
> > Postdoctoral Researcher
> > Lab of Dr. Robert Switzer
> > Dept of Biochemistry
> > University of Illinois Urbana-Champaign
> >
> >
> >
> >
>
> --
> Dr. Martin Mokrejs
> Faculty of Science, Charles University
> Vinicna 5, 128 43 Prague, Czech Republic
> http://www.iresite.org
> http://www.iresite.org/~mmokrejs
More information about the Bioperl-l
mailing list