[Bioperl-l] Memory requirements for conversion from embl to genbank

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Thu Aug 31 13:50:01 UTC 2006


It has slowed down after printing out, actually it stopped printing
out the text (but that could be because the output is buffered, hmm,
there used to be a way to unset buffering, I used to know the contents
of 'man perlopentut' some years ago, but its gone from my head now):

Acc:BB133146
Acc:BB199913
Acc:BB199915
Acc:BB199667
Acc:BB199670
Acc:BB199673
Acc:BB199676
Acc:BB199679
Acc:BB199682
Acc:BB228934
Acc:BB229388
Acc:BB229266
Acc:BB229267
Acc:BB199709
Acc:BB199710
Acc:BB199711
Acc:BB199712
Acc:BB200048
Acc:BB199986
Acc:BB199993


It hasn't died yet, but I guess it will in a while. The next record
which it did not spit out is:

ID   5HGB000664 standard; mRNA; VRL; 1892 BP.
XX
AC   BB199698;
XX
DT   20-NOV-2002 (Rel. 16, Created)
DT   20-NOV-2002 (Rel. 16, Last updated, Version 1)
XX
DE   5'UTR in Hepatitis GB virus B subgenomic replicon neoRepB
XX
DR   EMBL; AJ428955;
DR   UTR; CC221018;
XX
OS   Hepatitis GB virus B
OS   Encephalomyocarditis virus
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
OC   Cardiovirus.
XX
UT   5'UTR;
XX
FH   Key             Location/Qualifiers
FH
FT   5'UTR           1..1892
FT                   /source="EMBL::AJ428955:1..1892"
FT                   /product="non-structural polyprotein"
FT   VECTOR          477..1274
FT                   /source="EMBL::AJ428955:477..1274"
FT                   /evidence="Similarity"
FT                   /db_xref="EMBL:"
FT                   /note="Possible vector contamination"
FT                   /note="Length=798 BP. Identities=99.6%"
XX


Note the two /note feature lines. I guess the quoting code loops over
and keeps adding quote after a quote. ;-)

M.




Chris Fields wrote:
> Martin,
> 
> Do you get the same issue using SeqIO?
> 
> #/usr/bin/perl -w
> 
> use strict;
> use warnings;
> use Bio::SeqIO;
> 
> $file_in = '5UTR.Vrl_nr.dat';
> 
> $file_out = '5UTR.Vrl_nr.gb';
> 
> my $seqin = Bio::SeqIO->new(-format => 'embl',
>                             -file   => "<$file_in");
> 
> my $seqout = Bio::SeqIO->new(-format => 'genbank',
>                             -file   => ">$file_out");
> 
> while (my $seq = $seqin->next_seq) {
>     print "Acc:",$seq->accession,"\n";
>     $seqout->write_seq($seq);
> }
> 
> 
> Chris
> 
> 
> On Aug 31, 2006, at 7:44 AM, Martin MOKREJŠ wrote:
> 
>> Hi,
>>   I use bp_sreformat.pl to convert a file from embl format
>> to genbank. I use current cvs HEAD version and cannot parse
>> two files. Each record is small and I don't understand why
>> is the such a huge memory requirement. The machine has 1GB
>> RAM and running recent recent linux kernel. Moreover, I could
>> parse the same file with bioperl-1.5.1 when I have manually
>> fixed some missing quotes in the file.
>>
>>   With current changes to the embl & genbank parsing (bug #2077)
>> I no longer can parse the file.
>>
>>   Here is the memory status at the moment when the machine ran
>> out of memory and linux kernel killed the application:
>>
>>  1  0 803212  20936      8   2184    0    0     0     0 1062   38  99 
>> 1  0  0
>>  1  0 803208  19944      8   2184    0    0     0     0 1062   38 
>> 100  0  0  0
>>  1  0 803208  18828      8   2184    0    0     0     0 1061   37 
>> 100  0  0  0
>>  1  0 803204  17836      8   2184    0    0     0     0 1062   40 
>> 100  0  0  0
>>  1  0 803204  16844      8   2184    0    0     0     0 1062   48 
>> 100  0  0  0
>>  1  0 803200  15728      8   2184   32    0    32     0 1063   41 
>> 100  0  0  0
>>  1  0 803200  14736      8   2184    0    0     0     0 1062   41  99 
>> 1  0  0
>>  1  0 803196  13744      8   2184    0    0     0     0 1061   38 
>> 100  0  0  0
>>  1  0 803240  13640      8   2184    0   48     0    48 1063   68  99 
>> 1  0  0
>>  1  1 803240  12920      8   1984    0   40     0    40 1065  136 
>> 100  0  0  0
>>  1  1 803240  13192      8   1872    0 1056     0  1056 1114  326  96 
>> 4  0  0
>>  1  1 803240  14448      8   1336    0   20     0    20 1081  192  90
>> 10  0  0
>>  1  1 803240  13656      8   1232    0   28     0    28 1070  104  87
>> 13  0  0
>>  1  1 803240  12892      8   1260   32    4   176     4 1069  113  86
>> 14  0  0
>>  0  4 803240  12144      8   1344  192   24   612    24 1088  185  44
>> 16  0 40
>>  0  7 803240  11952      8   1180   32   32   508    32 1113  591  46
>> 23  0 32
>>  0  3 803240  11948      8   1336 1120  500 10816   500 4390 1397   2
>> 31  0 66
>>  2  6 803240  12056      8   1788  752  136  9412   136 6101 1795   0
>> 27  0 73
>>  0  7 803240  12176      8   1748   12    0  2180     0 1132  326   0
>> 20  0 80
>> procs -----------memory---------- ---swap-- -----io---- -system-- 
>> ----cpu----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs  us
>> sy id wa
>>  0  5 803240  12492      8   1508  136   32  7508    32 2610  865   4
>> 45  0 51
>>  0  6 803240  12056      8   2004   64    8  1456     8 1138  312   9
>> 18  0 73
>>  1  6 803240  12668      8   1452   96   28 14856    28 2434  658   0
>> 31  0 69
>>  0  7 803240  13240      8    564    0    0  3112     0 4602 1492   4
>> 38  0 58
>>  0 10 803240  12768      8    688   36 15272  6000 15272 2026  431  26
>> 39  0 35
>>  0  2  81780 966512      8   5692  108    0  2904     0 2204  372   0
>> 11  0 89
>>  0  3  81780 966204      8   6056  128    0   488     3 1155   82   1 
>> 0  0 99
>>  0  1  81780 965460      8   6260  492    0   696     0 1150  161   0 
>> 1 13 86
>>  0  1  81732 963652      8   7860    8    0  1608     0 1147  199   1 
>> 2 42 55
>>  0  1  81732 962052      8   8560    4    0   704     0 1129  177   6 
>> 1 43 50
>>  0  1  81732 960120      8   9128    0    0   568     0 1124  161  12 
>> 2 57 29
>>  0  1  81732 957512      8   9840    4    0   716     0 1137  191  13 
>> 2 27 58
>>  1  0  81732 954992      8  10640   32    0   832     0 1135  191  14 
>> 1 47 38
>>  1  0  81732 952824      8  11016    0    0   340     0 1096  128  64 
>> 1 18 16
>>  1  0  81732 952152      8  11092    0    0     0     0 1062   80  99 
>> 1  0  0
>>  1  0  81732 951424      8  11196    0    0     0     0 1062  105  99 
>> 1  0  0
>>  1  0  81732 950808      8  11264    0    0     0     0 1062   74  99 
>> 1  0  0
>>
>>
>> $ bp_sreformat.pl -if embl -of genbank -i 5UTR.Vrl_nr.dat -o 
>> 5UTR.Vrl_nr.gb
>> Killed
>> $
>>
>> The file can be obtained from ftp://bighost.ba.itb.cnr.it-fixed/pub/
>> Embnet/Database/UTR/data/
>>
>> I am not a perl guru so nor am familiar with bioperl code. Does 
>> someone know
>> whether the parsed records are held in the memory or not? It seems so.
>> I guess deleting the objects from memory can be done by dereferencing
>> them after they get written down in the new format immediately. Or,  the
>> garbage collector does not work well in perl 5.8.8.
>>
>> Thanks for any help.
>> Martin
>>
>> -- 
>> Dr. Martin Mokrejs
>> Faculty of Science, Charles University
>> Vinicna 5, 128 43 Prague, Czech Republic
>> http://www.iresite.org
>> http://www.iresite.org/~mmokrejs
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
> 
> 
> 
> 

-- 
Dr. Martin Mokrejs
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs



More information about the Bioperl-l mailing list