[Bioperl-l] Memory requirements for conversion from embl to genbank
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Thu Aug 31 14:44:42 UTC 2006
Hi Chris,
so it has been killed after a while.
Chris Fields wrote:
> Martin,
>
> That's the issue; I believe the tags are supposed to be unique (part of the
> EMBL standard, I think). I'll look at it but this may be, again, one of
> those issues which we may not fix as it's a problem with the input sequence
> (not in the correct format).
Why? You can merge at least the text in two, successively appearing /note feature
lines, right? Can you fix your code, Chris? It would take me a while to get
familiar with it. What I still have in mind, expect that mostly either
there is no closing quote or there are two closing quotes. And, single quote
appears often in the middle of the string, e.g. 5'UTR, 5'-UTR. As I already
mentioned that, the loop should be used as a last resort. And now you see why.
Definitely, the loop in genabnk.pm must have a builtin limit so it never adds
say more that 4 or 6 quotes.;)
>
> At the very least it should break out of an infinite loop with a thrown
> message. Have you tried adding a debugging statement to the specific line
> in genbank.pm to verify the infinite loop?
Definitely, I would even opt for ignoring such /note lines, they are not critical.
>
> Wow, you've run into a hornet's nest of bad sequences. Missing quotes, too
> many quotes, now this!
Reality. :(
M.
>
> Chris
>
>
>>-----Original Message-----
>>From: Martin MOKREJŠ [mailto:mmokrejs at ribosome.natur.cuni.cz]
>>Sent: Thursday, August 31, 2006 8:50 AM
>>To: Chris Fields
>>Cc: bioperl-l at lists.open-bio.org
>>Subject: Re: [Bioperl-l] Memory requirements for conversion from embl to
>>genbank
>>
>>It has slowed down after printing out, actually it stopped printing
>>out the text (but that could be because the output is buffered, hmm,
>>there used to be a way to unset buffering, I used to know the contents
>>of 'man perlopentut' some years ago, but its gone from my head now):
>>
>>Acc:BB133146
>>Acc:BB199913
>>Acc:BB199915
>>Acc:BB199667
>>Acc:BB199670
>>Acc:BB199673
>>Acc:BB199676
>>Acc:BB199679
>>Acc:BB199682
>>Acc:BB228934
>>Acc:BB229388
>>Acc:BB229266
>>Acc:BB229267
>>Acc:BB199709
>>Acc:BB199710
>>Acc:BB199711
>>Acc:BB199712
>>Acc:BB200048
>>Acc:BB199986
>>Acc:BB199993
>>
>>
>>It hasn't died yet, but I guess it will in a while. The next record
>>which it did not spit out is:
>>
>>ID 5HGB000664 standard; mRNA; VRL; 1892 BP.
>>XX
>>AC BB199698;
>>XX
>>DT 20-NOV-2002 (Rel. 16, Created)
>>DT 20-NOV-2002 (Rel. 16, Last updated, Version 1)
>>XX
>>DE 5'UTR in Hepatitis GB virus B subgenomic replicon neoRepB
>>XX
>>DR EMBL; AJ428955;
>>DR UTR; CC221018;
>>XX
>>OS Hepatitis GB virus B
>>OS Encephalomyocarditis virus
>>OC Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
>>OC Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
>>OC Cardiovirus.
>>XX
>>UT 5'UTR;
>>XX
>>FH Key Location/Qualifiers
>>FH
>>FT 5'UTR 1..1892
>>FT /source="EMBL::AJ428955:1..1892"
>>FT /product="non-structural polyprotein"
>>FT VECTOR 477..1274
>>FT /source="EMBL::AJ428955:477..1274"
>>FT /evidence="Similarity"
>>FT /db_xref="EMBL:"
>>FT /note="Possible vector contamination"
>>FT /note="Length=798 BP. Identities=99.6%"
>>XX
>>
>>
>>Note the two /note feature lines. I guess the quoting code loops over
>>and keeps adding quote after a quote. ;-)
>>
>>M.
>>
>>
>>
>>
>>Chris Fields wrote:
>>
>>>Martin,
>>>
>>>Do you get the same issue using SeqIO?
>>>
>>>#/usr/bin/perl -w
>>>
>>>use strict;
>>>use warnings;
>>>use Bio::SeqIO;
>>>
>>>$file_in = '5UTR.Vrl_nr.dat';
>>>
>>>$file_out = '5UTR.Vrl_nr.gb';
>>>
>>>my $seqin = Bio::SeqIO->new(-format => 'embl',
>>> -file => "<$file_in");
>>>
>>>my $seqout = Bio::SeqIO->new(-format => 'genbank',
>>> -file => ">$file_out");
>>>
>>>while (my $seq = $seqin->next_seq) {
>>> print "Acc:",$seq->accession,"\n";
>>> $seqout->write_seq($seq);
>>>}
>>>
>>>
>>>Chris
>>>
>>>
>>>On Aug 31, 2006, at 7:44 AM, Martin MOKREJŠ wrote:
>>>
>>>
>>>>Hi,
>>>> I use bp_sreformat.pl to convert a file from embl format
>>>>to genbank. I use current cvs HEAD version and cannot parse
>>>>two files. Each record is small and I don't understand why
>>>>is the such a huge memory requirement. The machine has 1GB
>>>>RAM and running recent recent linux kernel. Moreover, I could
>>>>parse the same file with bioperl-1.5.1 when I have manually
>>>>fixed some missing quotes in the file.
>>>>
>>>> With current changes to the embl & genbank parsing (bug #2077)
>>>>I no longer can parse the file.
>>>>
>>>> Here is the memory status at the moment when the machine ran
>>>>out of memory and linux kernel killed the application:
>>>>
>>>> 1 0 803212 20936 8 2184 0 0 0 0 1062 38 99
>>>>1 0 0
>>>> 1 0 803208 19944 8 2184 0 0 0 0 1062 38
>>>>100 0 0 0
>>>> 1 0 803208 18828 8 2184 0 0 0 0 1061 37
>>>>100 0 0 0
>>>> 1 0 803204 17836 8 2184 0 0 0 0 1062 40
>>>>100 0 0 0
>>>> 1 0 803204 16844 8 2184 0 0 0 0 1062 48
>>>>100 0 0 0
>>>> 1 0 803200 15728 8 2184 32 0 32 0 1063 41
>>>>100 0 0 0
>>>> 1 0 803200 14736 8 2184 0 0 0 0 1062 41 99
>>>>1 0 0
>>>> 1 0 803196 13744 8 2184 0 0 0 0 1061 38
>>>>100 0 0 0
>>>> 1 0 803240 13640 8 2184 0 48 0 48 1063 68 99
>>>>1 0 0
>>>> 1 1 803240 12920 8 1984 0 40 0 40 1065 136
>>>>100 0 0 0
>>>> 1 1 803240 13192 8 1872 0 1056 0 1056 1114 326 96
>>>>4 0 0
>>>> 1 1 803240 14448 8 1336 0 20 0 20 1081 192 90
>>>>10 0 0
>>>> 1 1 803240 13656 8 1232 0 28 0 28 1070 104 87
>>>>13 0 0
>>>> 1 1 803240 12892 8 1260 32 4 176 4 1069 113 86
>>>>14 0 0
>>>> 0 4 803240 12144 8 1344 192 24 612 24 1088 185 44
>>>>16 0 40
>>>> 0 7 803240 11952 8 1180 32 32 508 32 1113 591 46
>>>>23 0 32
>>>> 0 3 803240 11948 8 1336 1120 500 10816 500 4390 1397 2
>>>>31 0 66
>>>> 2 6 803240 12056 8 1788 752 136 9412 136 6101 1795 0
>>>>27 0 73
>>>> 0 7 803240 12176 8 1748 12 0 2180 0 1132 326 0
>>>>20 0 80
>>>>procs -----------memory---------- ---swap-- -----io---- -system--
>>>>----cpu----
>>>> r b swpd free buff cache si so bi bo in cs us
>>>>sy id wa
>>>> 0 5 803240 12492 8 1508 136 32 7508 32 2610 865 4
>>>>45 0 51
>>>> 0 6 803240 12056 8 2004 64 8 1456 8 1138 312 9
>>>>18 0 73
>>>> 1 6 803240 12668 8 1452 96 28 14856 28 2434 658 0
>>>>31 0 69
>>>> 0 7 803240 13240 8 564 0 0 3112 0 4602 1492 4
>>>>38 0 58
>>>> 0 10 803240 12768 8 688 36 15272 6000 15272 2026 431 26
>>>>39 0 35
>>>> 0 2 81780 966512 8 5692 108 0 2904 0 2204 372 0
>>>>11 0 89
>>>> 0 3 81780 966204 8 6056 128 0 488 3 1155 82 1
>>>>0 0 99
>>>> 0 1 81780 965460 8 6260 492 0 696 0 1150 161 0
>>>>1 13 86
>>>> 0 1 81732 963652 8 7860 8 0 1608 0 1147 199 1
>>>>2 42 55
>>>> 0 1 81732 962052 8 8560 4 0 704 0 1129 177 6
>>>>1 43 50
>>>> 0 1 81732 960120 8 9128 0 0 568 0 1124 161 12
>>>>2 57 29
>>>> 0 1 81732 957512 8 9840 4 0 716 0 1137 191 13
>>>>2 27 58
>>>> 1 0 81732 954992 8 10640 32 0 832 0 1135 191 14
>>>>1 47 38
>>>> 1 0 81732 952824 8 11016 0 0 340 0 1096 128 64
>>>>1 18 16
>>>> 1 0 81732 952152 8 11092 0 0 0 0 1062 80 99
>>>>1 0 0
>>>> 1 0 81732 951424 8 11196 0 0 0 0 1062 105 99
>>>>1 0 0
>>>> 1 0 81732 950808 8 11264 0 0 0 0 1062 74 99
>>>>1 0 0
>>>>
>>>>
>>>>$ bp_sreformat.pl -if embl -of genbank -i 5UTR.Vrl_nr.dat -o
>>>>5UTR.Vrl_nr.gb
>>>>Killed
>>>>$
>>>>
>>>>The file can be obtained from ftp://bighost.ba.itb.cnr.it-fixed/pub/
>>>>Embnet/Database/UTR/data/
>>>>
>>>>I am not a perl guru so nor am familiar with bioperl code. Does
>>>>someone know
>>>>whether the parsed records are held in the memory or not? It seems so.
>>>>I guess deleting the objects from memory can be done by dereferencing
>>>>them after they get written down in the new format immediately. Or,
>>
>>the
>>
>>>>garbage collector does not work well in perl 5.8.8.
>>>>
>>>>Thanks for any help.
>>>>Martin
>>>>
>>>>--
>>>>Dr. Martin Mokrejs
>>>>Faculty of Science, Charles University
>>>>Vinicna 5, 128 43 Prague, Czech Republic
>>>>http://www.iresite.org
>>>>http://www.iresite.org/~mmokrejs
>>>>_______________________________________________
>>>>Bioperl-l mailing list
>>>>Bioperl-l at lists.open-bio.org
>>>>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>Christopher Fields
>>>Postdoctoral Researcher
>>>Lab of Dr. Robert Switzer
>>>Dept of Biochemistry
>>>University of Illinois Urbana-Champaign
>>>
>>>
>>>
>>>
>>
>>--
>>Dr. Martin Mokrejs
>>Faculty of Science, Charles University
>>Vinicna 5, 128 43 Prague, Czech Republic
>>http://www.iresite.org
>>http://www.iresite.org/~mmokrejs
>
>
>
>
--
Dr. Martin Mokrejs
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs
More information about the Bioperl-l
mailing list