[Bioperl-l] Memory requirements for conversion from embl to genbank

Thu Aug 31 14:44:42 UTC 2006

Hi Chris,

so it has been killed after a while.

Chris Fields wrote:
> Martin,
> 
> That's the issue; I believe the tags are supposed to be unique (part of the
> EMBL standard, I think).  I'll look at it but this may be, again, one of
> those issues which we may not fix as it's a problem with the input sequence
> (not in the correct format).  

Why? You can merge at least the text in two, successively appearing /note feature
lines, right? Can you fix your code, Chris? It would take me a while to get
familiar with it. What I still have in mind, expect that mostly either
there is no closing quote or there are two closing quotes. And, single quote
appears often in the middle of the string, e.g. 5'UTR, 5'-UTR. As I already
mentioned that, the loop should be used as a last resort. And now you see why.
Definitely, the loop in genabnk.pm must have a builtin limit so it never adds
say more that 4 or 6 quotes.;)

> 
> At the very least it should break out of an infinite loop with a thrown
> message.  Have you tried adding a debugging statement to the specific line
> in genbank.pm to verify the infinite loop?

Definitely, I would even opt for ignoring such /note lines, they are not critical.

>  
> Wow, you've run into a hornet's nest of bad sequences.  Missing quotes, too
> many quotes, now this!

Reality. :(
M.

> 
> Chris
> 
> 
>>-----Original Message-----
>>From: Martin MOKREJŠ [mailto:mmokrejs at ribosome.natur.cuni.cz]
>>Sent: Thursday, August 31, 2006 8:50 AM
>>To: Chris Fields
>>Cc: bioperl-l at lists.open-bio.org
>>Subject: Re: [Bioperl-l] Memory requirements for conversion from embl to
>>genbank
>>
>>It has slowed down after printing out, actually it stopped printing
>>out the text (but that could be because the output is buffered, hmm,
>>there used to be a way to unset buffering, I used to know the contents
>>of 'man perlopentut' some years ago, but its gone from my head now):
>>
>>Acc:BB133146
>>Acc:BB199913
>>Acc:BB199915
>>Acc:BB199667
>>Acc:BB199670
>>Acc:BB199673
>>Acc:BB199676
>>Acc:BB199679
>>Acc:BB199682
>>Acc:BB228934
>>Acc:BB229388
>>Acc:BB229266
>>Acc:BB229267
>>Acc:BB199709
>>Acc:BB199710
>>Acc:BB199711
>>Acc:BB199712
>>Acc:BB200048
>>Acc:BB199986
>>Acc:BB199993
>>
>>
>>It hasn't died yet, but I guess it will in a while. The next record
>>which it did not spit out is:
>>
>>ID   5HGB000664 standard; mRNA; VRL; 1892 BP.
>>XX
>>AC   BB199698;
>>XX
>>DT   20-NOV-2002 (Rel. 16, Created)
>>DT   20-NOV-2002 (Rel. 16, Last updated, Version 1)
>>XX
>>DE   5'UTR in Hepatitis GB virus B subgenomic replicon neoRepB
>>XX
>>DR   EMBL; AJ428955;
>>DR   UTR; CC221018;
>>XX
>>OS   Hepatitis GB virus B
>>OS   Encephalomyocarditis virus
>>OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
>>OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
>>OC   Cardiovirus.
>>XX
>>UT   5'UTR;
>>XX
>>FH   Key             Location/Qualifiers
>>FH
>>FT   5'UTR           1..1892
>>FT                   /source="EMBL::AJ428955:1..1892"
>>FT                   /product="non-structural polyprotein"
>>FT   VECTOR          477..1274
>>FT                   /source="EMBL::AJ428955:477..1274"
>>FT                   /evidence="Similarity"
>>FT                   /db_xref="EMBL:"
>>FT                   /note="Possible vector contamination"
>>FT                   /note="Length=798 BP. Identities=99.6%"
>>XX
>>
>>
>>Note the two /note feature lines. I guess the quoting code loops over
>>and keeps adding quote after a quote. ;-)
>>
>>M.
>>
>>
>>
>>
>>Chris Fields wrote:
>>
>>>Martin,
>>>
>>>Do you get the same issue using SeqIO?
>>>
>>>#/usr/bin/perl -w
>>>
>>>use strict;
>>>use warnings;
>>>use Bio::SeqIO;
>>>
>>>$file_in = '5UTR.Vrl_nr.dat';
>>>
>>>$file_out = '5UTR.Vrl_nr.gb';
>>>
>>>my $seqin = Bio::SeqIO->new(-format => 'embl',
>>>                            -file   => "<$file_in");
>>>
>>>my $seqout = Bio::SeqIO->new(-format => 'genbank',
>>>                            -file   => ">$file_out");
>>>
>>>while (my $seq = $seqin->next_seq) {
>>>    print "Acc:",$seq->accession,"\n";
>>>    $seqout->write_seq($seq);
>>>}
>>>
>>>
>>>Chris
>>>
>>>
>>>On Aug 31, 2006, at 7:44 AM, Martin MOKREJŠ wrote:
>>>
>>>
>>>>Hi,
>>>>  I use bp_sreformat.pl to convert a file from embl format
>>>>to genbank. I use current cvs HEAD version and cannot parse
>>>>two files. Each record is small and I don't understand why
>>>>is the such a huge memory requirement. The machine has 1GB
>>>>RAM and running recent recent linux kernel. Moreover, I could
>>>>parse the same file with bioperl-1.5.1 when I have manually
>>>>fixed some missing quotes in the file.
>>>>
>>>>  With current changes to the embl & genbank parsing (bug #2077)
>>>>I no longer can parse the file.
>>>>
>>>>  Here is the memory status at the moment when the machine ran
>>>>out of memory and linux kernel killed the application:
>>>>
>>>> 1  0 803212  20936      8   2184    0    0     0     0 1062   38  99
>>>>1  0  0
>>>> 1  0 803208  19944      8   2184    0    0     0     0 1062   38
>>>>100  0  0  0
>>>> 1  0 803208  18828      8   2184    0    0     0     0 1061   37
>>>>100  0  0  0
>>>> 1  0 803204  17836      8   2184    0    0     0     0 1062   40
>>>>100  0  0  0
>>>> 1  0 803204  16844      8   2184    0    0     0     0 1062   48
>>>>100  0  0  0
>>>> 1  0 803200  15728      8   2184   32    0    32     0 1063   41
>>>>100  0  0  0
>>>> 1  0 803200  14736      8   2184    0    0     0     0 1062   41  99
>>>>1  0  0
>>>> 1  0 803196  13744      8   2184    0    0     0     0 1061   38
>>>>100  0  0  0
>>>> 1  0 803240  13640      8   2184    0   48     0    48 1063   68  99
>>>>1  0  0
>>>> 1  1 803240  12920      8   1984    0   40     0    40 1065  136
>>>>100  0  0  0
>>>> 1  1 803240  13192      8   1872    0 1056     0  1056 1114  326  96
>>>>4  0  0
>>>> 1  1 803240  14448      8   1336    0   20     0    20 1081  192  90
>>>>10  0  0
>>>> 1  1 803240  13656      8   1232    0   28     0    28 1070  104  87
>>>>13  0  0
>>>> 1  1 803240  12892      8   1260   32    4   176     4 1069  113  86
>>>>14  0  0
>>>> 0  4 803240  12144      8   1344  192   24   612    24 1088  185  44
>>>>16  0 40
>>>> 0  7 803240  11952      8   1180   32   32   508    32 1113  591  46
>>>>23  0 32
>>>> 0  3 803240  11948      8   1336 1120  500 10816   500 4390 1397   2
>>>>31  0 66
>>>> 2  6 803240  12056      8   1788  752  136  9412   136 6101 1795   0
>>>>27  0 73
>>>> 0  7 803240  12176      8   1748   12    0  2180     0 1132  326   0
>>>>20  0 80
>>>>procs -----------memory---------- ---swap-- -----io---- -system--
>>>>----cpu----
>>>> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs  us
>>>>sy id wa
>>>> 0  5 803240  12492      8   1508  136   32  7508    32 2610  865   4
>>>>45  0 51
>>>> 0  6 803240  12056      8   2004   64    8  1456     8 1138  312   9
>>>>18  0 73
>>>> 1  6 803240  12668      8   1452   96   28 14856    28 2434  658   0
>>>>31  0 69
>>>> 0  7 803240  13240      8    564    0    0  3112     0 4602 1492   4
>>>>38  0 58
>>>> 0 10 803240  12768      8    688   36 15272  6000 15272 2026  431  26
>>>>39  0 35
>>>> 0  2  81780 966512      8   5692  108    0  2904     0 2204  372   0
>>>>11  0 89
>>>> 0  3  81780 966204      8   6056  128    0   488     3 1155   82   1
>>>>0  0 99
>>>> 0  1  81780 965460      8   6260  492    0   696     0 1150  161   0
>>>>1 13 86
>>>> 0  1  81732 963652      8   7860    8    0  1608     0 1147  199   1
>>>>2 42 55
>>>> 0  1  81732 962052      8   8560    4    0   704     0 1129  177   6
>>>>1 43 50
>>>> 0  1  81732 960120      8   9128    0    0   568     0 1124  161  12
>>>>2 57 29
>>>> 0  1  81732 957512      8   9840    4    0   716     0 1137  191  13
>>>>2 27 58
>>>> 1  0  81732 954992      8  10640   32    0   832     0 1135  191  14
>>>>1 47 38
>>>> 1  0  81732 952824      8  11016    0    0   340     0 1096  128  64
>>>>1 18 16
>>>> 1  0  81732 952152      8  11092    0    0     0     0 1062   80  99
>>>>1  0  0
>>>> 1  0  81732 951424      8  11196    0    0     0     0 1062  105  99
>>>>1  0  0
>>>> 1  0  81732 950808      8  11264    0    0     0     0 1062   74  99
>>>>1  0  0
>>>>
>>>>
>>>>$ bp_sreformat.pl -if embl -of genbank -i 5UTR.Vrl_nr.dat -o
>>>>5UTR.Vrl_nr.gb
>>>>Killed
>>>>$
>>>>
>>>>The file can be obtained from ftp://bighost.ba.itb.cnr.it-fixed/pub/
>>>>Embnet/Database/UTR/data/
>>>>
>>>>I am not a perl guru so nor am familiar with bioperl code. Does
>>>>someone know
>>>>whether the parsed records are held in the memory or not? It seems so.
>>>>I guess deleting the objects from memory can be done by dereferencing
>>>>them after they get written down in the new format immediately. Or,
>>
>>the
>>
>>>>garbage collector does not work well in perl 5.8.8.
>>>>
>>>>Thanks for any help.
>>>>Martin
>>>>
>>>>--
>>>>Dr. Martin Mokrejs
>>>>Faculty of Science, Charles University
>>>>Vinicna 5, 128 43 Prague, Czech Republic
>>>>http://www.iresite.org
>>>>http://www.iresite.org/~mmokrejs
>>>>_______________________________________________
>>>>Bioperl-l mailing list
>>>>Bioperl-l at lists.open-bio.org
>>>>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>Christopher Fields
>>>Postdoctoral Researcher
>>>Lab of Dr. Robert Switzer
>>>Dept of Biochemistry
>>>University of Illinois Urbana-Champaign
>>>
>>>
>>>
>>>
>>
>>--
>>Dr. Martin Mokrejs
>>Faculty of Science, Charles University
>>Vinicna 5, 128 43 Prague, Czech Republic
>>http://www.iresite.org
>>http://www.iresite.org/~mmokrejs
> 
> 
> 
> 

-- 
Dr. Martin Mokrejs
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs