[Biojava-dev] Bug when reading FASTA file with many DNA Sequences

Scooter Willis HWillis at scripps.edu
Wed Feb 9 14:47:48 UTC 2011


Dan

I usually do 40MB DNA files with no problem. I will concat together and
test a 245MB version. What operating system and version of Java? 32bit or
64bit? The GC should be able to keep up in lazySequenceLoad mode.

You need to use File because an inputstream doesn't provide the ability to
random seek based on an offset as it could be an HTTP stream etc.

Thanks

Scooter

On 2/9/11 9:38 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk> wrote:

>Hi Scooter,
>
>Thanks for the quick feedback.
>
>Unfortunately, the memory isn't the issue, I set my JVM to use 2500MB max
>heap (the most I can get away with on my machine), and still encountered
>the same problem.  On a colleagues machine he has the max heap
>essentially unbounded and he still gets the same error.  It seems to be
>something to do with the garbage collector removing temporary items from
>memory rather than max available memory.
>http://stackoverflow.com/questions/1393486/what-means-the-error-message-ja
>va-lang-outofmemoryerror-gc-overhead-limit-excee
>
>Also the FastaReaderHelper.readFastaDNASequence method ran into the same
>problem.  I passed in an input stream rather than a file but I don't
>think that should cause the problem, should it?  Also I couldn't find an
>overloaded variant with the lazySequenceLoad signature.  Maybe I'm using
>an older version but I couldn't find it in the biojava3 API docs either.
>
>Any other ideas?
>
>Best regards,
>Dan
>
>
>>-----Original Message-----
>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>Sent: Wednesday, February 09, 2011 1:31 PM
>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>Sequences
>>
>>Daniel
>>
>>You have two options. The first is to run java -Xmx2048m (the rest of
>>your
>>parameters) and the out of memory error will go away. I have a Helper
>>method that will read the fasta file and lazy load when you request a
>>sequence. If you call this method
>>
>>FastaReaderHelper.readFastaDNASequnece(File f, boolean lazySequenceLoad)
>>you will be able to load the entire fasta file with minimal memory
>>requirements.
>>
>>Even though your fasta file is X when we load it into memory each
>>sequence
>>position gets represented by a Java object so the memory footprint will
>>be
>>larger.
>>
>>Let me know if you don't have that particular method in the jars you are
>>using. Not sure of the latest release on jars. If you look in the
>>biojava3-genome module you will find examples of working with the DNA
>>sequences to translate proteins etc assuming you have CDS features to
>>map
>>onto your sequences.
>>
>>Thanks
>>
>>Scooter
>>
>>On 2/9/11 7:13 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>>wrote:
>>
>>>Hello,
>>>
>>>I'm trying to read a FASTA file that contains just over 4000 DNA
>>>sequences and is around 270MB big.  Each sequence starts like this:
>>>">SequenceName" followed by a linefeed.  The actual DNA sequence data
>>>does contain a linefeed every 40 characters or so.
>>>
>>>I want to read in the data into a LinkedHashMap object, similar to the
>>>example you specify in your cookbook:
>>>
>>>FileInputStream inStream = new FileInputStream(genomeFile);
>>>FastaReader<DNASequence, NucleotideCompound> fastaReader = new
>>>FastaReader<DNASequence, NucleotideCompound>(
>>>              inStream,
>>>              new GenericFastaHeaderParser<DNASequence,
>>>NucleotideCompound>(),
>>>              new
>>>DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet()));
>>>
>>>        try {
>>>            genomeData = fastaReader.process();
>>>        } catch (Exception ex) { }
>>>
>>>This works on some files but not the one containing the 4000 sequences.
>>>I get an exception generated by the JVM:
>>>
>>>Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: GC
>>>overhead limit exceeded
>>>        at
>>>java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
>>>        at java.lang.StringBuilder.<init>(StringBuilder.java:80)
>>>        at
>>>org.biojava3.core.sequence.io.FastaReader.process(FastaReader.java:111)
>>>        at
>>>workbench.Process_Hits_Patman.openFastaFile(Process_Hits_Patman.java:41
>>4)
>>>        at
>>>workbench.Process_Hits_Patman.readFiles(Process_Hits_Patman.java:451)
>>>        at workbench.MirCat.openFile(MirCat.java:283)
>>>        at
>>>workbench.MainMDIWindow.openMenuItemActionPerformed(MainMDIWindow.java:
>>252
>>>)
>>>        at workbench.MainMDIWindow.access$000(MainMDIWindow.java:25)
>>>        at
>>>workbench.MainMDIWindow$1.actionPerformed(MainMDIWindow.java:140)
>>>        at
>>>javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995
>>)
>>>        at
>>>javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:
>>231
>>>8)
>>>        at
>>>javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.j
>>ava
>>>:387)
>>>        at
>>>javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
>>>        at javax.swing.AbstractButton.doClick(AbstractButton.java:357)
>>>        at
>>>javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:122
>>5)
>>>        at
>>>javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuI
>>tem
>>>UI.java:1266)
>>>        at java.awt.Component.processMouseEvent(Component.java:6263)
>>>        at
>>javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>>>        at java.awt.Component.processEvent(Component.java:6028)
>>>        at java.awt.Container.processEvent(Container.java:2041)
>>>        at java.awt.Component.dispatchEventImpl(Component.java:4630)
>>>        at java.awt.Container.dispatchEventImpl(Container.java:2099)
>>>        at java.awt.Component.dispatchEvent(Component.java:4460)
>>>        at
>>>java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4574)
>>>        at
>>>java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238)
>>>        at
>>>java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
>>>        at java.awt.Container.dispatchEventImpl(Container.java:2085)
>>>        at java.awt.Window.dispatchEventImpl(Window.java:2475)
>>>        at java.awt.Component.dispatchEvent(Component.java:4460)
>>>        at java.awt.EventQueue.dispatchEvent(EventQueue.java:599)
>>>        at
>>>java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread
>>.ja
>>>va:269)
>>>        at
>>>java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.ja
>>va:
>>>184)
>>>
>>>The amount of memory on the system isn't an issue.  I tried this on a
>>>machine with 12GB of RAM.  It seems to be an issue with the garbage
>>>collector getting tired of deleting temporary objects!  Also I noticed
>>>that although the file is less than 300MB large, the actual amount of
>>>heap space used increases from 100MB to over 900MB when in
>>>FastaReader.Process before the exception occurs.
>>>
>>>Unfortunately I can't share the FASTA file that is causing the problem.
>>>
>>>Would it be possible for you guys to look into this and either produce
>>a
>>>fix or suggest a workaround?  Also do you think there is someway to
>>>optimise the performance and memory usage of this process?
>>>
>>>Finally, I have a question about selectively loading sequences from a
>>>FASTA file.  The idea being to reduce memory usage.  Is it possibility
>>to
>>>do this using biojava?  i.e. given a DNA sequence name, only load that
>>>sequence into memory?  Or do we have to load the entire FASTA file into
>>a
>>>LinkedHashMap each time?
>>>
>>>Thanks in advance for your help on this one,
>>>
>>>Best regards,
>>>Dr Daniel Mapleson (UEA)
>>>
>>>_______________________________________________
>>>biojava-dev mailing list
>>>biojava-dev at lists.open-bio.org
>>>http://lists.open-bio.org/mailman/listinfo/biojava-dev
>





More information about the biojava-dev mailing list