[Bioperl-l] Asking for advice on full EMBL extraction

Smithies, Russell Russell.Smithies at agresearch.co.nz
Sun May 10 20:49:56 UTC 2009


How about splitting the big file into smaller chunks and processing one sequence at a time?
It could be one specific feature line that's causing the segfault and nothing to do with file size.
You should be able to split the file with awk as well (I like awk :-)

zcat rel_ann_mus_01_r99.dat.gz | awk 'BEGIN{RS="//";OFS="\n"}{$1=$1; print > "chunk"NR}' 

--Russell

> -----Original Message-----
> From: brian li [mailto:brianli.cas at gmail.com]
> Sent: Saturday, 9 May 2009 2:49 a.m.
> To: Smithies, Russell
> Cc: bioperl-l at lists.open-bio.org; Jason Stajich; Chris Fields
> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
> 
> open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk
> '!/^FT|^CO/{print}' |" works.
> open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^
> /{print}' |" segfaults.
> 
> So it seems the features are causing problems. Although I still don't
> know how that hurts my os to pop a segfault, my extraction can move on
> again. Maybe I can find a clue when I know more about my os's memory
> management strategy.
> 
> Really appreciate all your help.
> 
> -Brian
> 
> On Fri, May 8, 2009 at 8:03 AM, Smithies, Russell
> <Russell.Smithies at agresearch.co.nz> wrote:
> > I think the problem here though is the size of the sequences rather than too
> > many features.
> >
> > If one was inclined to bodge/hack and didn't care about sequence, I guess
> > you could filter them out with awk so Bio::SeqIO doesn't have to create the
> > Bio::PrimarySeq J
> >
> > Probably breaks the EMBL file spec .
> >
> > Eg.
> >
> > open( $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ /{print}' |"
> > ) or die;
> >
> >
> >
> >
> >
> > --Russell
> >
> >
> >
> >
> >
> >
> >
> > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason
> > Stajich
> > Sent: Friday, 8 May 2009 11:25 a.m.
> > To: Smithies, Russell
> > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org'
> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
> >
> >
> >
> > It parses from a stream or file, one sequence at a time so it only reads a
> > single sequence out at a time, but it does have to parse that whole sequence
> > record which is where feature rich sequences might be causing problems.
> >
> >
> >
> > I think per your other mention of Tie::File - the whole file is not going
> > into memory so that is not the problem, it is the creation of many objects
> > that it does as it parses the sequence that is likely the problem.  It will
> > read up to the first "//" from that Tie::File anyways, that becomes an
> > entire string which is then parsed to pull out the relevant features so you
> > don't gain anything with Tie::File -- what would be the way to solve it is
> > if the objects could be created and reside in a DB on disk rather than
> > in-memory.  I'd really enjoy seeing more indexed and hashed data to objects
> > stored on disk when mem requirements are such so that very large datasets
> > can be handled more nimbly.
> >
> >
> >
> > I think there have been several attempts to simplify, but it basically means
> > a dedicated developer to really overhaul or map to a new system.  What we've
> > tried to build is a decent API so a new implementation can be done without
> > affecting the 'next_seq' and 'write_seq' API.
> >
> >
> >
> > Non-withstanding the seemed API confusion caused by _ancient_ decisions on
> > giving function names of Bio::SeqFeatureI 'seq' and Bio::PrimarySeq 'seq'
> > which return different types -- don't forget that Lincoln's Bio::DB::Fasta
> > uses the 'seq' method to return a sequence as a string as well so major API
> > changes in general here will create in all likelihood a big split between
> > the branches that will make any new Bioperl not match up well with existing
> > scripts or libraries that use it - hence the reason for no "great
> > realigning" to a completely well-planned out API rather than the organically
> > grown whims of several generations of devs.  I say this in jest a bit - I do
> > want to see changes, but I think it really will have to be called something
> > else besides BioPerl to avoid confusion and the fact that a lot of things
> > will break that depend on the current APIs.  BioPerl2 or something
> > indicating a Perl6 association.
> >
> >
> >
> > -jason
> >
> > On May 7, 2009, at 3:05 PM, Smithies, Russell wrote:
> >
> > OK, I misunderstood, I thought the entire file loaded was loaded into memory
> > first then each sequence was extracted from there.
> > I hoped splitting into 588 individual sequences might help.
> >
> > --Russell
> >
> > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason
> > Stajich
> > Sent: Friday, 8 May 2009 9:55 a.m.
> > To: Smithies, Russell
> > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org'
> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
> >
> > Russell -
> >
> > I am not sure how that will help as only 1 sequence is parsed at a time by
> > SeqIO parsers and they use the "//" delimiter.
> >
> > If the equivalent data exists in genbank format at NCBI I think _that_
> >  module (Bio::SeqIO::genbank) has the ability to ignore
> > annotations/features.  Really we have to re-work the whole thing to be more
> > lightweight and lazy-parse.
> >
> > -jason
> > On May 7, 2009, at 2:24 PM, Smithies, Russell wrote:
> >
> >
> > I'm not sure if this will help with your problem or how it deals with memory
> > management but using "ordinary" Perl to split the large EMBL file might
> > work.
> > Give this a go:
> >
> > ============================
> > #!perl -w
> >
> > use Bio::SeqIO;
> > use IO::String;
> >
> > use constant SEP => "//\n";
> >
> > open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die;
> >
> > my $index = 1;
> >
> > while(my $stringfh = new IO::String(get_next_record($fh))){
> >
> >          my $seqio = Bio::SeqIO->new( -fh     => $stringfh,-format => "EMBL"
> > ) or die $!;
> >
> >          while ( my $seq_object = $seqio->next_seq ) {
> >           print "Dealing with entry: ".$index++."\t".$seq_object->id."\n";
> >
> >           # show the features
> >           for my $feat_object ($seq_object->get_SeqFeatures) {
> >                        print "primary tag: ", $feat_object->primary_tag,
> > "\n";
> >                        for my $tag ($feat_object->get_all_tags) {
> >                           print "  tag: ", $tag, "\n";
> >                           for my $value ($feat_object->get_tag_values($tag))
> > {
> >                              print "    value: ", $value, "\n";
> >                           }
> >                        }
> >                      }
> >          }
> >
> > }
> >
> >
> > sub get_next_record{
> >          my($fh) = @_;
> >          (my $old_sep,$/) = ($/,SEP);
> >          my $record = <$fh>;
> >          $/ = $old_sep;
> >          return $record;
> > }
> > ========================================
> >
> >
> > --Russell
> >
> >
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On Behalf Of
> > brian li
> > Sent: Friday, 8 May 2009 1:00 a.m.
> > To: Chris Fields
> > Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
> >
> > My server has 32 GB RAM.
> >
> > The os of my server is 64-bit version of Ubuntu Server Edition 8.04
> > LTS. And I have run my example code on another server with 32-bit
> > version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again.
> >
> > -Brian
> >
> > On Thu, May 7, 2009 at 8:07 PM, Chris Fields
> > <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
> > I noticed that Russell has 16GB RAM on his setup.  Was yours equivalent?
> >
> > chris
> >
> > On May 7, 2009, at 12:32 AM, brian li wrote:
> >
> > Thank you very much for your offer.
> >
> > The director of our lab wants me to do the extraction every time a new
> > release of EMBL is published. I can't push the task to you every time.
> >
> > I can offer more information of the server I run my script on if needed.
> >
> > -Brian
> >
> > On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell
> >
> <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>>
> > wrote:
> >
> > Sadly, that's the same code as I ran but I had a Data::Dump in the
> > middle.
> > Versions of Perl and BioPerl are the same.
> > We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM
> >
> > If you get a full script running on a smaller dataset, I could probably
> > run it on the bigger stuff and give you back tab-separated (or is that
> > tab\tseparated ?) data for loading into your db.
> >
> > --Russell
> >
> > -----Original Message-----
> > From: brian li [mailto:brianli.cas at gmail.com]
> > Sent: Thursday, 7 May 2009 4:50 p.m.
> > To: Smithies, Russell
> > Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
> >
> > Dear Russell,
> >
> > My example code is as following. I omit the parse process and these
> > lines give me "Segmentation Fault" too.
> >
> > # Start of code
> > my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat',
> >                                            -format => 'EMBL');
> > my $index = 1;
> > while (my $seq = $seqio->next_seq)
> > {
> >   print "Dealing with entry: $index\n";
> >   $index++;
> > }
> > # End
> >
> > The platform I run this code on:
> > BioPerl 1.6.0
> > Perl 5.8.8
> > Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server)
> >
> > I have monitored the memory usage when I run the code above. There is
> > always around 20GB free memory (buffer size counted in) left. So I
> > suppose the segfault can't be explained just by memory shortage.
> >
> > Brian
> >
> >
> > On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell
> >
> <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>>
> > wrote:
> >
> > Hi Brian,
> > I hate to say it but it worked OK for me using
> > rel_ann_mus_01_r99.dat.gz and
> >
> > simple example Bio::SeqIO code from bugzilla
> >
> > It's not using more than 1GB memory on our server and doesn't segfault.
> >
> > Send me your example code and I'll give it a go if you like.
> >
> >
> > Russell Smithies
> >
> > Bioinformatics Applications Developer
> > T +64 3 489 9085
> > E
> >  russell.smithies at agresearch.co.nz<mailto:russell.smithies at agresearch.co.nz>
> >
> > Invermay  Research Centre
> > Puddle Alley,
> > Mosgiel,
> > New Zealand
> > T  +64 3 489 3809
> > F  +64 3 489 9174
> > www.agresearch.co.nz<http://www.agresearch.co.nz>
> >
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > Jason Stajich
> > jason at bioperl.org<mailto:jason at bioperl.org>
> >
> >
> >
> >
> >
> > Jason Stajich
> >
> > jason at bioperl.org
> >
> >
> >
> >
> >
> >
> >
> >




More information about the Bioperl-l mailing list