[Bioperl-l] Memory requirements for conversion from embl to genbank
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Thu Aug 31 13:50:01 UTC 2006
It has slowed down after printing out, actually it stopped printing
out the text (but that could be because the output is buffered, hmm,
there used to be a way to unset buffering, I used to know the contents
of 'man perlopentut' some years ago, but its gone from my head now):
Acc:BB133146
Acc:BB199913
Acc:BB199915
Acc:BB199667
Acc:BB199670
Acc:BB199673
Acc:BB199676
Acc:BB199679
Acc:BB199682
Acc:BB228934
Acc:BB229388
Acc:BB229266
Acc:BB229267
Acc:BB199709
Acc:BB199710
Acc:BB199711
Acc:BB199712
Acc:BB200048
Acc:BB199986
Acc:BB199993
It hasn't died yet, but I guess it will in a while. The next record
which it did not spit out is:
ID 5HGB000664 standard; mRNA; VRL; 1892 BP.
XX
AC BB199698;
XX
DT 20-NOV-2002 (Rel. 16, Created)
DT 20-NOV-2002 (Rel. 16, Last updated, Version 1)
XX
DE 5'UTR in Hepatitis GB virus B subgenomic replicon neoRepB
XX
DR EMBL; AJ428955;
DR UTR; CC221018;
XX
OS Hepatitis GB virus B
OS Encephalomyocarditis virus
OC Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
OC Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
OC Cardiovirus.
XX
UT 5'UTR;
XX
FH Key Location/Qualifiers
FH
FT 5'UTR 1..1892
FT /source="EMBL::AJ428955:1..1892"
FT /product="non-structural polyprotein"
FT VECTOR 477..1274
FT /source="EMBL::AJ428955:477..1274"
FT /evidence="Similarity"
FT /db_xref="EMBL:"
FT /note="Possible vector contamination"
FT /note="Length=798 BP. Identities=99.6%"
XX
Note the two /note feature lines. I guess the quoting code loops over
and keeps adding quote after a quote. ;-)
M.
Chris Fields wrote:
> Martin,
>
> Do you get the same issue using SeqIO?
>
> #/usr/bin/perl -w
>
> use strict;
> use warnings;
> use Bio::SeqIO;
>
> $file_in = '5UTR.Vrl_nr.dat';
>
> $file_out = '5UTR.Vrl_nr.gb';
>
> my $seqin = Bio::SeqIO->new(-format => 'embl',
> -file => "<$file_in");
>
> my $seqout = Bio::SeqIO->new(-format => 'genbank',
> -file => ">$file_out");
>
> while (my $seq = $seqin->next_seq) {
> print "Acc:",$seq->accession,"\n";
> $seqout->write_seq($seq);
> }
>
>
> Chris
>
>
> On Aug 31, 2006, at 7:44 AM, Martin MOKREJŠ wrote:
>
>> Hi,
>> I use bp_sreformat.pl to convert a file from embl format
>> to genbank. I use current cvs HEAD version and cannot parse
>> two files. Each record is small and I don't understand why
>> is the such a huge memory requirement. The machine has 1GB
>> RAM and running recent recent linux kernel. Moreover, I could
>> parse the same file with bioperl-1.5.1 when I have manually
>> fixed some missing quotes in the file.
>>
>> With current changes to the embl & genbank parsing (bug #2077)
>> I no longer can parse the file.
>>
>> Here is the memory status at the moment when the machine ran
>> out of memory and linux kernel killed the application:
>>
>> 1 0 803212 20936 8 2184 0 0 0 0 1062 38 99
>> 1 0 0
>> 1 0 803208 19944 8 2184 0 0 0 0 1062 38
>> 100 0 0 0
>> 1 0 803208 18828 8 2184 0 0 0 0 1061 37
>> 100 0 0 0
>> 1 0 803204 17836 8 2184 0 0 0 0 1062 40
>> 100 0 0 0
>> 1 0 803204 16844 8 2184 0 0 0 0 1062 48
>> 100 0 0 0
>> 1 0 803200 15728 8 2184 32 0 32 0 1063 41
>> 100 0 0 0
>> 1 0 803200 14736 8 2184 0 0 0 0 1062 41 99
>> 1 0 0
>> 1 0 803196 13744 8 2184 0 0 0 0 1061 38
>> 100 0 0 0
>> 1 0 803240 13640 8 2184 0 48 0 48 1063 68 99
>> 1 0 0
>> 1 1 803240 12920 8 1984 0 40 0 40 1065 136
>> 100 0 0 0
>> 1 1 803240 13192 8 1872 0 1056 0 1056 1114 326 96
>> 4 0 0
>> 1 1 803240 14448 8 1336 0 20 0 20 1081 192 90
>> 10 0 0
>> 1 1 803240 13656 8 1232 0 28 0 28 1070 104 87
>> 13 0 0
>> 1 1 803240 12892 8 1260 32 4 176 4 1069 113 86
>> 14 0 0
>> 0 4 803240 12144 8 1344 192 24 612 24 1088 185 44
>> 16 0 40
>> 0 7 803240 11952 8 1180 32 32 508 32 1113 591 46
>> 23 0 32
>> 0 3 803240 11948 8 1336 1120 500 10816 500 4390 1397 2
>> 31 0 66
>> 2 6 803240 12056 8 1788 752 136 9412 136 6101 1795 0
>> 27 0 73
>> 0 7 803240 12176 8 1748 12 0 2180 0 1132 326 0
>> 20 0 80
>> procs -----------memory---------- ---swap-- -----io---- -system--
>> ----cpu----
>> r b swpd free buff cache si so bi bo in cs us
>> sy id wa
>> 0 5 803240 12492 8 1508 136 32 7508 32 2610 865 4
>> 45 0 51
>> 0 6 803240 12056 8 2004 64 8 1456 8 1138 312 9
>> 18 0 73
>> 1 6 803240 12668 8 1452 96 28 14856 28 2434 658 0
>> 31 0 69
>> 0 7 803240 13240 8 564 0 0 3112 0 4602 1492 4
>> 38 0 58
>> 0 10 803240 12768 8 688 36 15272 6000 15272 2026 431 26
>> 39 0 35
>> 0 2 81780 966512 8 5692 108 0 2904 0 2204 372 0
>> 11 0 89
>> 0 3 81780 966204 8 6056 128 0 488 3 1155 82 1
>> 0 0 99
>> 0 1 81780 965460 8 6260 492 0 696 0 1150 161 0
>> 1 13 86
>> 0 1 81732 963652 8 7860 8 0 1608 0 1147 199 1
>> 2 42 55
>> 0 1 81732 962052 8 8560 4 0 704 0 1129 177 6
>> 1 43 50
>> 0 1 81732 960120 8 9128 0 0 568 0 1124 161 12
>> 2 57 29
>> 0 1 81732 957512 8 9840 4 0 716 0 1137 191 13
>> 2 27 58
>> 1 0 81732 954992 8 10640 32 0 832 0 1135 191 14
>> 1 47 38
>> 1 0 81732 952824 8 11016 0 0 340 0 1096 128 64
>> 1 18 16
>> 1 0 81732 952152 8 11092 0 0 0 0 1062 80 99
>> 1 0 0
>> 1 0 81732 951424 8 11196 0 0 0 0 1062 105 99
>> 1 0 0
>> 1 0 81732 950808 8 11264 0 0 0 0 1062 74 99
>> 1 0 0
>>
>>
>> $ bp_sreformat.pl -if embl -of genbank -i 5UTR.Vrl_nr.dat -o
>> 5UTR.Vrl_nr.gb
>> Killed
>> $
>>
>> The file can be obtained from ftp://bighost.ba.itb.cnr.it-fixed/pub/
>> Embnet/Database/UTR/data/
>>
>> I am not a perl guru so nor am familiar with bioperl code. Does
>> someone know
>> whether the parsed records are held in the memory or not? It seems so.
>> I guess deleting the objects from memory can be done by dereferencing
>> them after they get written down in the new format immediately. Or, the
>> garbage collector does not work well in perl 5.8.8.
>>
>> Thanks for any help.
>> Martin
>>
>> --
>> Dr. Martin Mokrejs
>> Faculty of Science, Charles University
>> Vinicna 5, 128 43 Prague, Czech Republic
>> http://www.iresite.org
>> http://www.iresite.org/~mmokrejs
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
>
--
Dr. Martin Mokrejs
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs
More information about the Bioperl-l
mailing list