[Biojava-dev] Bug when reading FASTA file with many DNA Sequences

Mapleson Daniel Dr (CMP) D.Mapleson at uea.ac.uk
Wed Feb 9 15:07:33 UTC 2011


I'm running Windows 7 Enterprise 64-bit and Java 1.6 64-bit.  I double checked I was using the 64-bit version of java at runtime using JConsole.  I'm also using biojava3, in case that makes a difference.

I tried the FastaReaderHelper.readFastaDNASequence(File f) version of the method, but I still haven't found the lazySequenceLoad version.  Same problem.

Best regards,
Dan

>-----Original Message-----
>From: Scooter Willis [mailto:HWillis at scripps.edu]
>Sent: Wednesday, February 09, 2011 2:48 PM
>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>Sequences
>
>Dan
>
>I usually do 40MB DNA files with no problem. I will concat together and
>test a 245MB version. What operating system and version of Java? 32bit
>or
>64bit? The GC should be able to keep up in lazySequenceLoad mode.
>
>You need to use File because an inputstream doesn't provide the ability
>to
>random seek based on an offset as it could be an HTTP stream etc.
>
>Thanks
>
>Scooter
>
>On 2/9/11 9:38 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>wrote:
>
>>Hi Scooter,
>>
>>Thanks for the quick feedback.
>>
>>Unfortunately, the memory isn't the issue, I set my JVM to use 2500MB
>max
>>heap (the most I can get away with on my machine), and still
>encountered
>>the same problem.  On a colleagues machine he has the max heap
>>essentially unbounded and he still gets the same error.  It seems to be
>>something to do with the garbage collector removing temporary items
>from
>>memory rather than max available memory.
>>http://stackoverflow.com/questions/1393486/what-means-the-error-
>message-ja
>>va-lang-outofmemoryerror-gc-overhead-limit-excee
>>
>>Also the FastaReaderHelper.readFastaDNASequence method ran into the
>same
>>problem.  I passed in an input stream rather than a file but I don't
>>think that should cause the problem, should it?  Also I couldn't find
>an
>>overloaded variant with the lazySequenceLoad signature.  Maybe I'm
>using
>>an older version but I couldn't find it in the biojava3 API docs
>either.
>>
>>Any other ideas?
>>
>>Best regards,
>>Dan
>>
>>
>>>-----Original Message-----
>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>Sent: Wednesday, February 09, 2011 1:31 PM
>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>>Sequences
>>>
>>>Daniel
>>>
>>>You have two options. The first is to run java -Xmx2048m (the rest of
>>>your
>>>parameters) and the out of memory error will go away. I have a Helper
>>>method that will read the fasta file and lazy load when you request a
>>>sequence. If you call this method
>>>
>>>FastaReaderHelper.readFastaDNASequnece(File f, boolean
>lazySequenceLoad)
>>>you will be able to load the entire fasta file with minimal memory
>>>requirements.
>>>
>>>Even though your fasta file is X when we load it into memory each
>>>sequence
>>>position gets represented by a Java object so the memory footprint
>will
>>>be
>>>larger.
>>>
>>>Let me know if you don't have that particular method in the jars you
>are
>>>using. Not sure of the latest release on jars. If you look in the
>>>biojava3-genome module you will find examples of working with the DNA
>>>sequences to translate proteins etc assuming you have CDS features to
>>>map
>>>onto your sequences.
>>>
>>>Thanks
>>>
>>>Scooter
>>>
>>>On 2/9/11 7:13 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>>>wrote:
>>>
>>>>Hello,
>>>>
>>>>I'm trying to read a FASTA file that contains just over 4000 DNA
>>>>sequences and is around 270MB big.  Each sequence starts like this:
>>>>">SequenceName" followed by a linefeed.  The actual DNA sequence data
>>>>does contain a linefeed every 40 characters or so.
>>>>
>>>>I want to read in the data into a LinkedHashMap object, similar to
>the
>>>>example you specify in your cookbook:
>>>>
>>>>FileInputStream inStream = new FileInputStream(genomeFile);
>>>>FastaReader<DNASequence, NucleotideCompound> fastaReader = new
>>>>FastaReader<DNASequence, NucleotideCompound>(
>>>>              inStream,
>>>>              new GenericFastaHeaderParser<DNASequence,
>>>>NucleotideCompound>(),
>>>>              new
>>>>DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet()));
>>>>
>>>>        try {
>>>>            genomeData = fastaReader.process();
>>>>        } catch (Exception ex) { }
>>>>
>>>>This works on some files but not the one containing the 4000
>sequences.
>>>>I get an exception generated by the JVM:
>>>>
>>>>Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: GC
>>>>overhead limit exceeded
>>>>        at
>>>>java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
>>>>        at java.lang.StringBuilder.<init>(StringBuilder.java:80)
>>>>        at
>>>>org.biojava3.core.sequence.io.FastaReader.process(FastaReader.java:11
>1)
>>>>        at
>>>>workbench.Process_Hits_Patman.openFastaFile(Process_Hits_Patman.java:
>41
>>>4)
>>>>        at
>>>>workbench.Process_Hits_Patman.readFiles(Process_Hits_Patman.java:451)
>>>>        at workbench.MirCat.openFile(MirCat.java:283)
>>>>        at
>>>>workbench.MainMDIWindow.openMenuItemActionPerformed(MainMDIWindow.jav
>a:
>>>252
>>>>)
>>>>        at workbench.MainMDIWindow.access$000(MainMDIWindow.java:25)
>>>>        at
>>>>workbench.MainMDIWindow$1.actionPerformed(MainMDIWindow.java:140)
>>>>        at
>>>>javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:19
>95
>>>)
>>>>        at
>>>>javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.jav
>a:
>>>231
>>>>8)
>>>>        at
>>>>javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel
>.j
>>>ava
>>>>:387)
>>>>        at
>>>>javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242
>)
>>>>        at
>javax.swing.AbstractButton.doClick(AbstractButton.java:357)
>>>>        at
>>>>javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1
>22
>>>5)
>>>>        at
>>>>javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMen
>uI
>>>tem
>>>>UI.java:1266)
>>>>        at java.awt.Component.processMouseEvent(Component.java:6263)
>>>>        at
>>>javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>>>>        at java.awt.Component.processEvent(Component.java:6028)
>>>>        at java.awt.Container.processEvent(Container.java:2041)
>>>>        at java.awt.Component.dispatchEventImpl(Component.java:4630)
>>>>        at java.awt.Container.dispatchEventImpl(Container.java:2099)
>>>>        at java.awt.Component.dispatchEvent(Component.java:4460)
>>>>        at
>>>>java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4574
>)
>>>>        at
>>>>java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238)
>>>>        at
>>>>java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
>>>>        at java.awt.Container.dispatchEventImpl(Container.java:2085)
>>>>        at java.awt.Window.dispatchEventImpl(Window.java:2475)
>>>>        at java.awt.Component.dispatchEvent(Component.java:4460)
>>>>        at java.awt.EventQueue.dispatchEvent(EventQueue.java:599)
>>>>        at
>>>>java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThre
>ad
>>>.ja
>>>>va:269)
>>>>        at
>>>>java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.
>ja
>>>va:
>>>>184)
>>>>
>>>>The amount of memory on the system isn't an issue.  I tried this on a
>>>>machine with 12GB of RAM.  It seems to be an issue with the garbage
>>>>collector getting tired of deleting temporary objects!  Also I
>noticed
>>>>that although the file is less than 300MB large, the actual amount of
>>>>heap space used increases from 100MB to over 900MB when in
>>>>FastaReader.Process before the exception occurs.
>>>>
>>>>Unfortunately I can't share the FASTA file that is causing the
>problem.
>>>>
>>>>Would it be possible for you guys to look into this and either
>produce
>>>a
>>>>fix or suggest a workaround?  Also do you think there is someway to
>>>>optimise the performance and memory usage of this process?
>>>>
>>>>Finally, I have a question about selectively loading sequences from a
>>>>FASTA file.  The idea being to reduce memory usage.  Is it
>possibility
>>>to
>>>>do this using biojava?  i.e. given a DNA sequence name, only load
>that
>>>>sequence into memory?  Or do we have to load the entire FASTA file
>into
>>>a
>>>>LinkedHashMap each time?
>>>>
>>>>Thanks in advance for your help on this one,
>>>>
>>>>Best regards,
>>>>Dr Daniel Mapleson (UEA)
>>>>
>>>>_______________________________________________
>>>>biojava-dev mailing list
>>>>biojava-dev at lists.open-bio.org
>>>>http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>





More information about the biojava-dev mailing list