[Biojava-l] memory leak while reading nr.fasta

mark.schreiber at novartis.com mark.schreiber at novartis.com
Mon Jul 4 01:46:55 EDT 2005


It is supposed to only read on demand. Are you sure it isn't??

As long as you don't keep references to the individual sequences they 
should be destroyed by the garbage collector. If there is a real memory 
leak something must be keeping references to them but this is not the 
intended behaivour. This would be a serious bug. A while back there was a 
problem with change listeners not getting disposed of. I thought this was 
resolved but possibly it was not.

Would need an example to track this down.

- Mark





"Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
Sent by: biojava-l-bounces at portal.open-bio.org
07/04/2005 01:33 PM

 
        To:     <biojava-l at biojava.org>
        cc:     Gem Yang <Gem.Yang at jhu.edu>, (bcc: Mark Schreiber/GP/Novartis)
        Subject:        RE: [Biojava-l] memory leak while reading nr.fasta


This is one big problem, and I've come across it before.
SeqIOTools.fileToBiojava reads the whole file in at once and stores
everything in memory as Sequence objects in a virtual sequence database.
For a file the size of nr, this is simply impossible on most machines,
and causes out-of-memory exceptions.

What is required for files this size is a SeqIOTools parser that reads
sequence objects _on demand_ as requested by the iterator, rather than
reading the whole lot at once. This way it can drop sequence objects
once they have been passed over by the iterator, freeing up memory for
subsequent ones (assuming the client app keeps no references to them
either). How this fits in with BioJava's "everything is a sequence
database" philosophy or not I don't know, as essentially it breaks it by
defining a file to be a sequential-access sequence database, rather than
a random-access one.

Can someone clarify if a lazy-loading parser/database implementation
already exists for situations like this, or does one need to be written?

cheers,
Richard

Richard Holland
Bioinformatics Specialist
GIS extension 8199
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: biojava-l-bounces at portal.open-bio.org 
> [mailto:biojava-l-bounces at portal.open-bio.org] On Behalf Of Gem Yang
> Sent: Friday, July 01, 2005 2:30 AM
> To: biojava-l at biojava.org
> Subject: [Biojava-l] memory leak while reading nr.fasta
> 
> 
> Hi,
> 
> I am new to Biojava. 
> I have the following program, which is copied from ReadFaster2 in the
> cookbook.
> 
> public static void main(String[] args) {
>                try {
>                                // args[0] is nr.fasta
>                  BufferedReader br = new BufferedReader(new 
> FileReader(args[0]));
> 
>                  String format = "FASTA";
>                  String alphabet = "PROTEIN";
> 
>                  SequenceIterator iter =
> quenceIterator)SeqIOTools.fileToBiojava(format,alphabet, br);
> 
>                  int count =0; 
>                  long start = System.currentTimeMillis();
>                  while(iter.hasNext())
>                  {
>                                                Sequence s = 
iter.nextSequence();
>                                                String name = 
s.getName();
> 
> //System.out.println(name);
>                                                s.getAnnotation();
> //System.out.println(s.seqString());
>                                                count ++;
> System.out.println(count);
> 
>                  }
>                  long end = System.currentTimeMillis();
>                  System.out.println("number of sequence " + count);
>                  System.out.println("time used" + (end-start)/1000 + 
> "seconds");
>                  System.out.println((end-start)/1000/60 + "minutes");
>                }
>                catch (FileNotFoundException ex) {
>                  //can't find file specified by args[0]
>                  ex.printStackTrace();
>                }catch (BioException ex) {
>                  //error parsing requested format
>                  ex.printStackTrace();
>                }
>   }
> 
> When running this code, I got out of memory error in about 
> half an hour and
> 1.5GB memory allocated.  My workstation is a Windows XP with 
> 2 GB of memory.
> My biojava version is 1.3. My JRE is one came with Websphere 
> application
> developer.
> 
> Thanks.
> Gem
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l





More information about the Biojava-l mailing list