[Biojava-dev] Bug when reading FASTA file with many DNA Sequences

Scooter Willis HWillis at scripps.edu
Wed Feb 9 16:27:06 UTC 2011


Dan

Glad that worked. If you only need to use a sequence once then you could
remove it from the hashmap(allowing GC) and that should keep your memory
low. 

Scooter

On 2/9/11 11:22 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
wrote:

>Thanks Scooter.  That's great.  This jar fixes the problem.  The file is
>loaded really quickly and memory usage has decreased massively too.
>
>We will need to process each (most) sequence(s), in the file at some
>stage, so the ability to free up memory containing sequences that aren't
>currently being processed/used will be useful going forward with our
>project.  It's not urgent though.  The main thing from my perspective is
>that we can actually run the program, which your fix allows us to do.
>
>Thanks again for the quick turnaround!  Much appreciated! :)
>
>Best regards,
>Dan
>
>>-----Original Message-----
>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>Sent: Wednesday, February 09, 2011 3:50 PM
>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>Sequences
>>
>>Dan
>>
>>I have attached a copy of my biojava3-core that has that method in it as
>>well as other memory/speed optimizations I worked on. Sounds like that
>>method(recently added) hasn't made its way into the current biojava3
>>jars.
>>You should see a dramatic reduction in memory if you only need to select
>>a sub-set of sequences. Trying to load a 245MB fasta file does take lots
>>of memory. If you plan on reading each sequence then you will eventually
>>run into a memory problem as I am currently not freeing up the sequence
>>data that is loaded lazily. My plan is to add some optimization
>>hints/logic that the developer can control that every time you load a
>>new sequence and use more memory I will internally free up sequence data
>>that has been allocated. If you go back to a sequence that has had
>>storage deallocated then I will simply reload it. This way you can work
>>with very large sequence files at a genome scale without running out of
>>memory or being forced to put in a database.
>>
>>Let me know if this works and if you need to analyze every sequence and
>>will see if I can find some time to add in the lazyload memory
>>management features.
>>
>>Thanks
>>
>>Scooter
>>
>>
>>
>>
>>
>>
>>On 2/9/11 10:07 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>>wrote:
>>
>>>I'm running Windows 7 Enterprise 64-bit and Java 1.6 64-bit.  I double
>>>checked I was using the 64-bit version of java at runtime using
>>JConsole.
>>> I'm also using biojava3, in case that makes a difference.
>>>
>>>I tried the FastaReaderHelper.readFastaDNASequence(File f) version of
>>>the method, but I still haven't found the lazySequenceLoad version.
>>>Same problem.
>>>
>>>Best regards,
>>>Dan
>>>
>>>>-----Original Message-----
>>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>>Sent: Wednesday, February 09, 2011 2:48 PM
>>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>>>Sequences
>>>>
>>>>Dan
>>>>
>>>>I usually do 40MB DNA files with no problem. I will concat together
>>>>and test a 245MB version. What operating system and version of Java?
>>>>32bit or 64bit? The GC should be able to keep up in lazySequenceLoad
>>>>mode.
>>>>
>>>>You need to use File because an inputstream doesn't provide the
>>>>ability to random seek based on an offset as it could be an HTTP
>>>>stream etc.
>>>>
>>>>Thanks
>>>>
>>>>Scooter
>>>>
>>>>On 2/9/11 9:38 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>>>>wrote:
>>>>
>>>>>Hi Scooter,
>>>>>
>>>>>Thanks for the quick feedback.
>>>>>
>>>>>Unfortunately, the memory isn't the issue, I set my JVM to use 2500MB
>>>>max
>>>>>heap (the most I can get away with on my machine), and still
>>>>encountered
>>>>>the same problem.  On a colleagues machine he has the max heap
>>>>>essentially unbounded and he still gets the same error.  It seems to
>>>>>be something to do with the garbage collector removing temporary
>>>>>items
>>>>from
>>>>>memory rather than max available memory.
>>>>>http://stackoverflow.com/questions/1393486/what-means-the-error-
>>>>message-ja
>>>>>va-lang-outofmemoryerror-gc-overhead-limit-excee
>>>>>
>>>>>Also the FastaReaderHelper.readFastaDNASequence method ran into the
>>>>same
>>>>>problem.  I passed in an input stream rather than a file but I don't
>>>>>think that should cause the problem, should it?  Also I couldn't find
>>>>an
>>>>>overloaded variant with the lazySequenceLoad signature.  Maybe I'm
>>>>using
>>>>>an older version but I couldn't find it in the biojava3 API docs
>>>>either.
>>>>>
>>>>>Any other ideas?
>>>>>
>>>>>Best regards,
>>>>>Dan
>>>>>
>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>>>>Sent: Wednesday, February 09, 2011 1:31 PM
>>>>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>>>>>Sequences
>>>>>>
>>>>>>Daniel
>>>>>>
>>>>>>You have two options. The first is to run java -Xmx2048m (the rest
>>>>>>of your
>>>>>>parameters) and the out of memory error will go away. I have a
>>>>>>Helper method that will read the fasta file and lazy load when you
>>>>>>request a sequence. If you call this method
>>>>>>
>>>>>>FastaReaderHelper.readFastaDNASequnece(File f, boolean
>>>>lazySequenceLoad)
>>>>>>you will be able to load the entire fasta file with minimal memory
>>>>>>requirements.
>>>>>>
>>>>>>Even though your fasta file is X when we load it into memory each
>>>>>>sequence position gets represented by a Java object so the memory
>>>>>>footprint
>>>>will
>>>>>>be
>>>>>>larger.
>>>>>>
>>>>>>Let me know if you don't have that particular method in the jars you
>>>>are
>>>>>>using. Not sure of the latest release on jars. If you look in the
>>>>>>biojava3-genome module you will find examples of working with the
>>>>>>DNA sequences to translate proteins etc assuming you have CDS
>>>>>>features to map onto your sequences.
>>>>>>
>>>>>>Thanks
>>>>>>
>>>>>>Scooter
>>>>>>
>>>>>>On 2/9/11 7:13 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>>>>>>wrote:
>>>>>>
>>>>>>>Hello,
>>>>>>>
>>>>>>>I'm trying to read a FASTA file that contains just over 4000 DNA
>>>>>>>sequences and is around 270MB big.  Each sequence starts like this:
>>>>>>>">SequenceName" followed by a linefeed.  The actual DNA sequence
>>>>>>>data does contain a linefeed every 40 characters or so.
>>>>>>>
>>>>>>>I want to read in the data into a LinkedHashMap object, similar to
>>>>the
>>>>>>>example you specify in your cookbook:
>>>>>>>
>>>>>>>FileInputStream inStream = new FileInputStream(genomeFile);
>>>>>>>FastaReader<DNASequence, NucleotideCompound> fastaReader = new
>>>>>>>FastaReader<DNASequence, NucleotideCompound>(
>>>>>>>              inStream,
>>>>>>>              new GenericFastaHeaderParser<DNASequence,
>>>>>>>NucleotideCompound>(),
>>>>>>>              new
>>>>>>>DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet()));
>>>>>>>
>>>>>>>        try {
>>>>>>>            genomeData = fastaReader.process();
>>>>>>>        } catch (Exception ex) { }
>>>>>>>
>>>>>>>This works on some files but not the one containing the 4000
>>>>sequences.
>>>>>>>I get an exception generated by the JVM:
>>>>>>>
>>>>>>>Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError:
>>>>>>>GC overhead limit exceeded
>>>>>>>        at
>>>>>>>java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:4
>>5)
>>>>>>>        at java.lang.StringBuilder.<init>(StringBuilder.java:80)
>>>>>>>        at
>>>>>>>org.biojava3.core.sequence.io.FastaReader.process(FastaReader.java:
>>>>>>>11
>>>>1)
>>>>>>>        at
>>>>>>>workbench.Process_Hits_Patman.openFastaFile(Process_Hits_Patman.jav
>>a:
>>>>41
>>>>>>4)
>>>>>>>        at
>>>>>>>workbench.Process_Hits_Patman.readFiles(Process_Hits_Patman.java:45
>>1)
>>>>>>>        at workbench.MirCat.openFile(MirCat.java:283)
>>>>>>>        at
>>>>>>>workbench.MainMDIWindow.openMenuItemActionPerformed(MainMDIWindow.j
>>>>>>>av
>>>>a:
>>>>>>252
>>>>>>>)
>>>>>>>        at
>>workbench.MainMDIWindow.access$000(MainMDIWindow.java:25)
>>>>>>>        at
>>>>>>>workbench.MainMDIWindow$1.actionPerformed(MainMDIWindow.java:140)
>>>>>>>        at
>>>>>>>javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:
>>>>>>>19
>>>>95
>>>>>>)
>>>>>>>        at
>>>>>>>javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.j
>>>>>>>av
>>>>a:
>>>>>>231
>>>>>>>8)
>>>>>>>        at
>>>>>>>javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonMod
>>>>>>>el
>>>>.j
>>>>>>ava
>>>>>>>:387)
>>>>>>>        at
>>>>>>>javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:2
>>>>>>>42
>>>>)
>>>>>>>        at
>>>>javax.swing.AbstractButton.doClick(AbstractButton.java:357)
>>>>>>>        at
>>>>>>>javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java
>>>>>>>:1
>>>>22
>>>>>>5)
>>>>>>>        at
>>>>>>>javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicM
>>>>>>>en
>>>>uI
>>>>>>tem
>>>>>>>UI.java:1266)
>>>>>>>        at
>>java.awt.Component.processMouseEvent(Component.java:6263)
>>>>>>>        at
>>>>>>javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>>>>>>>        at java.awt.Component.processEvent(Component.java:6028)
>>>>>>>        at java.awt.Container.processEvent(Container.java:2041)
>>>>>>>        at
>>java.awt.Component.dispatchEventImpl(Component.java:4630)
>>>>>>>        at
>>java.awt.Container.dispatchEventImpl(Container.java:2099)
>>>>>>>        at java.awt.Component.dispatchEvent(Component.java:4460)
>>>>>>>        at
>>>>>>>java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:45
>>>>>>>74
>>>>)
>>>>>>>        at
>>>>>>>java.awt.LightweightDispatcher.processMouseEvent(Container.java:423
>>8)
>>>>>>>        at
>>>>>>>java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
>>>>>>>        at
>>java.awt.Container.dispatchEventImpl(Container.java:2085)
>>>>>>>        at java.awt.Window.dispatchEventImpl(Window.java:2475)
>>>>>>>        at java.awt.Component.dispatchEvent(Component.java:4460)
>>>>>>>        at java.awt.EventQueue.dispatchEvent(EventQueue.java:599)
>>>>>>>        at
>>>>>>>java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchTh
>>>>>>>re
>>>>ad
>>>>>>.ja
>>>>>>>va:269)
>>>>>>>        at
>>>>>>>java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThrea
>>d.
>>>>ja
>>>>>>va:
>>>>>>>184)
>>>>>>>
>>>>>>>The amount of memory on the system isn't an issue.  I tried this on
>>>>>>>a machine with 12GB of RAM.  It seems to be an issue with the
>>>>>>>garbage collector getting tired of deleting temporary objects!
>>>>>>>Also I
>>>>noticed
>>>>>>>that although the file is less than 300MB large, the actual amount
>>>>>>>of heap space used increases from 100MB to over 900MB when in
>>>>>>>FastaReader.Process before the exception occurs.
>>>>>>>
>>>>>>>Unfortunately I can't share the FASTA file that is causing the
>>>>problem.
>>>>>>>
>>>>>>>Would it be possible for you guys to look into this and either
>>>>produce
>>>>>>a
>>>>>>>fix or suggest a workaround?  Also do you think there is someway to
>>>>>>>optimise the performance and memory usage of this process?
>>>>>>>
>>>>>>>Finally, I have a question about selectively loading sequences from
>>>>>>>a FASTA file.  The idea being to reduce memory usage.  Is it
>>>>possibility
>>>>>>to
>>>>>>>do this using biojava?  i.e. given a DNA sequence name, only load
>>>>that
>>>>>>>sequence into memory?  Or do we have to load the entire FASTA file
>>>>into
>>>>>>a
>>>>>>>LinkedHashMap each time?
>>>>>>>
>>>>>>>Thanks in advance for your help on this one,
>>>>>>>
>>>>>>>Best regards,
>>>>>>>Dr Daniel Mapleson (UEA)
>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>biojava-dev mailing list
>>>>>>>biojava-dev at lists.open-bio.org
>>>>>>>http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>
>>>
>





More information about the biojava-dev mailing list