[Bioperl-l] Asking for advice on full EMBL extraction

Thu May 7 21:54:39 UTC 2009

Russell -

I am not sure how that will help as only 1 sequence is parsed at a  
time by SeqIO parsers and they use the "//" delimiter.

If the equivalent data exists in genbank format at NCBI I think  
_that_  module (Bio::SeqIO::genbank) has the ability to ignore  
annotations/features.  Really we have to re-work the whole thing to be  
more lightweight and lazy-parse.

-jason
On May 7, 2009, at 2:24 PM, Smithies, Russell wrote:

> I'm not sure if this will help with your problem or how it deals  
> with memory management but using "ordinary" Perl to split the large  
> EMBL file might work.
> Give this a go:
>
> ============================
> #!perl -w
>
> use Bio::SeqIO;
> use IO::String;
>
> use constant SEP => "//\n";
>
> open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die;
>
> my $index = 1;
>
> while(my $stringfh = new IO::String(get_next_record($fh))){
>
> 	my $seqio = Bio::SeqIO->new( -fh     => $stringfh,-format =>  
> "EMBL" ) or die $!;
>
> 	while ( my $seq_object = $seqio->next_seq ) {
> 	  print "Dealing with entry: ".$index++."\t".$seq_object->id."\n";
> 	
> 	  # show the features
> 	  for my $feat_object ($seq_object->get_SeqFeatures) {
> 		   print "primary tag: ", $feat_object->primary_tag, "\n";
> 		   for my $tag ($feat_object->get_all_tags) {
> 		      print "  tag: ", $tag, "\n";
> 		      for my $value ($feat_object->get_tag_values($tag)) {
> 		         print "    value: ", $value, "\n";
> 		      }
> 		   }
> 		}
> 	}
>
> }
>
>
> sub get_next_record{
> 	my($fh) = @_;
> 	(my $old_sep,$/) = ($/,SEP);
> 	my $record = <$fh>;
> 	$/ = $old_sep;
> 	return $record;
> }
> ========================================
>
>
> --Russell
>
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of brian li
>> Sent: Friday, 8 May 2009 1:00 a.m.
>> To: Chris Fields
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>>
>> My server has 32 GB RAM.
>>
>> The os of my server is 64-bit version of Ubuntu Server Edition 8.04
>> LTS. And I have run my example code on another server with 32-bit
>> version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again.
>>
>> -Brian
>>
>> On Thu, May 7, 2009 at 8:07 PM, Chris Fields  
>> <cjfields at illinois.edu> wrote:
>>> I noticed that Russell has 16GB RAM on his setup.  Was yours  
>>> equivalent?
>>>
>>> chris
>>>
>>> On May 7, 2009, at 12:32 AM, brian li wrote:
>>>
>>>> Thank you very much for your offer.
>>>>
>>>> The director of our lab wants me to do the extraction every time  
>>>> a new
>>>> release of EMBL is published. I can't push the task to you every  
>>>> time.
>>>>
>>>> I can offer more information of the server I run my script on if  
>>>> needed.
>>>>
>>>> -Brian
>>>>
>>>> On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell
>>>> <Russell.Smithies at agresearch.co.nz> wrote:
>>>>>
>>>>> Sadly, that's the same code as I ran but I had a Data::Dump in the
>>>>> middle.
>>>>> Versions of Perl and BioPerl are the same.
>>>>> We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM
>>>>>
>>>>> If you get a full script running on a smaller dataset, I could  
>>>>> probably
>>>>> run it on the bigger stuff and give you back tab-separated (or  
>>>>> is that
>>>>> tab\tseparated ?) data for loading into your db.
>>>>>
>>>>> --Russell
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: brian li [mailto:brianli.cas at gmail.com]
>>>>>> Sent: Thursday, 7 May 2009 4:50 p.m.
>>>>>> To: Smithies, Russell
>>>>>> Cc: bioperl-l at lists.open-bio.org
>>>>>> Subject: Re: [Bioperl-l] Asking for advice on full EMBL  
>>>>>> extraction
>>>>>>
>>>>>> Dear Russell,
>>>>>>
>>>>>> My example code is as following. I omit the parse process and  
>>>>>> these
>>>>>> lines give me "Segmentation Fault" too.
>>>>>>
>>>>>> # Start of code
>>>>>> my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat',
>>>>>>                                             -format => 'EMBL');
>>>>>> my $index = 1;
>>>>>> while (my $seq = $seqio->next_seq)
>>>>>> {
>>>>>>    print "Dealing with entry: $index\n";
>>>>>>    $index++;
>>>>>> }
>>>>>> # End
>>>>>>
>>>>>> The platform I run this code on:
>>>>>> BioPerl 1.6.0
>>>>>> Perl 5.8.8
>>>>>> Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server)
>>>>>>
>>>>>> I have monitored the memory usage when I run the code above.  
>>>>>> There is
>>>>>> always around 20GB free memory (buffer size counted in) left.  
>>>>>> So I
>>>>>> suppose the segfault can't be explained just by memory shortage.
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>>
>>>>>> On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell
>>>>>> <Russell.Smithies at agresearch.co.nz> wrote:
>>>>>>>
>>>>>>> Hi Brian,
>>>>>>> I hate to say it but it worked OK for me using
>>>>>>> rel_ann_mus_01_r99.dat.gz and
>>>>>>
>>>>>> simple example Bio::SeqIO code from bugzilla
>>>>>>>
>>>>>>> It's not using more than 1GB memory on our server and doesn't  
>>>>>>> segfault.
>>>>>>>
>>>>>>> Send me your example code and I'll give it a go if you like.
>>>>>>>
>>>>>>>
>>>>>>> Russell Smithies
>>>>>>>
>>>>>>> Bioinformatics Applications Developer
>>>>>>> T +64 3 489 9085
>>>>>>> E  russell.smithies at agresearch.co.nz
>>>>>>>
>>>>>>> Invermay  Research Centre
>>>>>>> Puddle Alley,
>>>>>>> Mosgiel,
>>>>>>> New Zealand
>>>>>>> T  +64 3 489 3809
>>>>>>> F  +64 3 489 9174
>>>>>>> www.agresearch.co.nz
>>>>>>>
>>>>>>>
>>>>> = 
>>>>> = 
>>>>> = 
>>>>> = 
>>>>> = 
>>>>> ==================================================================
>>>>> Attention: The information contained in this message and/or  
>>>>> attachments
>>>>> from AgResearch Limited is intended only for the persons or  
>>>>> entities
>>>>> to which it is addressed and may contain confidential and/or  
>>>>> privileged
>>>>> material. Any review, retransmission, dissemination or other use  
>>>>> of, or
>>>>> taking of any action in reliance upon, this information by  
>>>>> persons or
>>>>> entities other than the intended recipients is prohibited by  
>>>>> AgResearch
>>>>> Limited. If you have received this message in error, please  
>>>>> notify the
>>>>> sender immediately.
>>>>> = 
>>>>> = 
>>>>> = 
>>>>> = 
>>>>> = 
>>>>> ==================================================================
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason at bioperl.org