[Biojava-l] memory leak while reading nr.fasta

Richard HOLLAND hollandr at gis.a-star.edu.sg
Mon Jul 4 01:33:41 EDT 2005


This is one big problem, and I've come across it before.
SeqIOTools.fileToBiojava reads the whole file in at once and stores
everything in memory as Sequence objects in a virtual sequence database.
For a file the size of nr, this is simply impossible on most machines,
and causes out-of-memory exceptions.

What is required for files this size is a SeqIOTools parser that reads
sequence objects _on demand_ as requested by the iterator, rather than
reading the whole lot at once. This way it can drop sequence objects
once they have been passed over by the iterator, freeing up memory for
subsequent ones (assuming the client app keeps no references to them
either). How this fits in with BioJava's "everything is a sequence
database" philosophy or not I don't know, as essentially it breaks it by
defining a file to be a sequential-access sequence database, rather than
a random-access one.

Can someone clarify if a lazy-loading parser/database implementation
already exists for situations like this, or does one need to be written?

cheers,
Richard

Richard Holland
Bioinformatics Specialist
GIS extension 8199
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: biojava-l-bounces at portal.open-bio.org 
> [mailto:biojava-l-bounces at portal.open-bio.org] On Behalf Of Gem Yang
> Sent: Friday, July 01, 2005 2:30 AM
> To: biojava-l at biojava.org
> Subject: [Biojava-l] memory leak while reading nr.fasta
> 
> 
> Hi,
> 
> I am new to Biojava.  
> I have the following program, which is copied from ReadFaster2 in the
> cookbook.
> 
> public static void main(String[] args) {
> 	try {
> 		// args[0] is nr.fasta
> 	  BufferedReader br = new BufferedReader(new 
> FileReader(args[0]));
> 
> 	  String format = "FASTA";
> 	  String alphabet = "PROTEIN";
> 
> 	  SequenceIterator iter =
> quenceIterator)SeqIOTools.fileToBiojava(format,alphabet, br);
> 
> 	  int count =0; 
> 	  long start = System.currentTimeMillis();
> 	  while(iter.hasNext())
> 	  {
> 	  		Sequence s = iter.nextSequence();
> 	  		String name = s.getName();
> 	  		
> 	  		//System.out.println(name);
> 	  		s.getAnnotation();
> 	  		//System.out.println(s.seqString());
> 	  		count ++;
> 	  		System.out.println(count);
> 	  		
> 	  }
> 	  long end = System.currentTimeMillis();
> 	  System.out.println("number of sequence " + count);
> 	  System.out.println("time used" + (end-start)/1000 + 
> "seconds");
> 	  System.out.println((end-start)/1000/60 + "minutes");
> 	}
> 	catch (FileNotFoundException ex) {
> 	  //can't find file specified by args[0]
> 	  ex.printStackTrace();
> 	}catch (BioException ex) {
> 	  //error parsing requested format
> 	  ex.printStackTrace();
> 	}
>   }
> 
> When running this code, I got out of memory error in about 
> half an hour and
> 1.5GB memory allocated.  My workstation is a Windows XP with 
> 2 GB of memory.
> My biojava version is 1.3. My JRE is one came with Websphere 
> application
> developer.
> 
> Thanks.
> Gem
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 



More information about the Biojava-l mailing list