[Biojava-dev] Bug when reading FASTA file with many DNA Sequences

Fri Feb 11 16:34:07 UTC 2011

Thanks Scooter, that's done the trick.

Best regards,
Dan

>-----Original Message-----
>From: Scooter Willis [mailto:HWillis at scripps.edu]
>Sent: Friday, February 11, 2011 2:10 PM
>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>Sequences
>
>Dan
>
>The FastaReaderHelper method for the lazyload is the following where the
>Helper concept is to hide the confusion of flexibility. You can
>substitute
>the appropriate CompoundSet for your code.
>
>You can also replace the GenericFastaHeaderParser with your own header
>parser if you have meta data in the accession line. This allows you to
>parse and then set the values into the sequence if you need to access
>them
>as part of your code.
>
>Let me know if you have any other issues.
>
>Thanks
>
>Scooter
>
>
>        FastaReader<DNASequence, NucleotideCompound> fastaProxyReader =
>new FastaReader<DNASequence, NucleotideCompound>(file, new
>GenericFastaHeaderParser<DNASequence, NucleotideCompound>(), new
>FileProxyDNASequenceCreator(file, DNACompoundSet.getDNACompoundSet()));
>        return fastaProxyReader.process();
>
>
>
>On 2/10/11 10:15 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>wrote:
>
>>Hi Scooter,
>>
>>Thanks for all you help yesterday, that was much appreciated.  Sorry to
>>trouble you again but while the modified jar you provided worked great
>>for the file I was working with yesterday, I have to process some other
>>files that contain ambiguous dna nucleotides (particularly "Y").  I
>>noticed that the readFastaDNASequence hardcodes the use of
>DNACompoundSet
>>rather than the AmbiguityDNACompoundSet.  Is there any chance of
>getting
>>an overloaded version of the readFastaDNASequence that allows you to
>set
>>ambiguous or unambiguous compound sets?
>>
>>Best regards,
>>Dan
>>
>>>-----Original Message-----
>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>Sent: Wednesday, February 09, 2011 4:47 PM
>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>>Sequences
>>>
>>>Dan
>>>
>>>You can check out the code via subversion to be using the latest and
>>>greatest. Our goal is to have minimal changes in biojava3-core but
>>>modules
>>>that depend on core will change more frequently. We use Maven for
>>>building. If you are not using Maven but use Netbeans then should be
>>>easy
>>>to setup. Easy for eclipse as well but not sure how much configuration
>>>is
>>>required. This way if you have requirements easier for me to make
>>>changes
>>>and check in the code that you can then test.
>>>
>>>Scooter
>>>
>>>On 2/9/11 11:40 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>>>wrote:
>>>
>>>>Scooter,
>>>>
>>>>Yup, I'll keep that in mind going forward, although I suspect our use
>>>>case will involve going back to the sequence more than once in some
>>>>instances.
>>>>
>>>>However, readFastaDNASequence is pretty swift now so we could manage
>>>this
>>>>to some extent by removing the hashmap and calling the
>>>>readFastaDNASequence with lazysequenceload again if required.  Not
>>>ideal,
>>>>but it's a workaround in a case where we had to remove a particular
>>>>sequence from the hashmap to save memory, and then later realised we
>>>need
>>>>it again.
>>>>
>>>>Is there some place I can view the latest changes to the biojava jars
>>>on
>>>>your wiki?  I'd like to keep an eye on new functionality that gets
>>>added.
>>>>
>>>>Best regards,
>>>>Dan
>>>>
>>>>
>>>>
>>>>>-----Original Message-----
>>>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>>>Sent: Wednesday, February 09, 2011 4:27 PM
>>>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>>>>Sequences
>>>>>
>>>>>Dan
>>>>>
>>>>>Glad that worked. If you only need to use a sequence once then you
>>>could
>>>>>remove it from the hashmap(allowing GC) and that should keep your
>>>memory
>>>>>low.
>>>>>
>>>>>Scooter
>>>>>
>>>>>On 2/9/11 11:22 AM, "Mapleson Daniel Dr (CMP)"
><D.Mapleson at uea.ac.uk>
>>>>>wrote:
>>>>>
>>>>>>Thanks Scooter.  That's great.  This jar fixes the problem.  The
>file
>>>>>is
>>>>>>loaded really quickly and memory usage has decreased massively too.
>>>>>>
>>>>>>We will need to process each (most) sequence(s), in the file at
>some
>>>>>>stage, so the ability to free up memory containing sequences that
>>>>>aren't
>>>>>>currently being processed/used will be useful going forward with
>our
>>>>>>project.  It's not urgent though.  The main thing from my
>perspective
>>>>>is
>>>>>>that we can actually run the program, which your fix allows us to
>do.
>>>>>>
>>>>>>Thanks again for the quick turnaround!  Much appreciated! :)
>>>>>>
>>>>>>Best regards,
>>>>>>Dan
>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>>>>>Sent: Wednesday, February 09, 2011 3:50 PM
>>>>>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>>>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many
>DNA
>>>>>>>Sequences
>>>>>>>
>>>>>>>Dan
>>>>>>>
>>>>>>>I have attached a copy of my biojava3-core that has that method in
>>>it
>>>>>as
>>>>>>>well as other memory/speed optimizations I worked on. Sounds like
>>>that
>>>>>>>method(recently added) hasn't made its way into the current
>biojava3
>>>>>>>jars.
>>>>>>>You should see a dramatic reduction in memory if you only need to
>>>>>select
>>>>>>>a sub-set of sequences. Trying to load a 245MB fasta file does
>take
>>>>>lots
>>>>>>>of memory. If you plan on reading each sequence then you will
>>>>>eventually
>>>>>>>run into a memory problem as I am currently not freeing up the
>>>>>sequence
>>>>>>>data that is loaded lazily. My plan is to add some optimization
>>>>>>>hints/logic that the developer can control that every time you
>load
>>>a
>>>>>>>new sequence and use more memory I will internally free up
>sequence
>>>>>data
>>>>>>>that has been allocated. If you go back to a sequence that has had
>>>>>>>storage deallocated then I will simply reload it. This way you can
>>>>>work
>>>>>>>with very large sequence files at a genome scale without running
>out
>>>>>of
>>>>>>>memory or being forced to put in a database.
>>>>>>>
>>>>>>>Let me know if this works and if you need to analyze every
>sequence
>>>>>and
>>>>>>>will see if I can find some time to add in the lazyload memory
>>>>>>>management features.
>>>>>>>
>>>>>>>Thanks
>>>>>>>
>>>>>>>Scooter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>On 2/9/11 10:07 AM, "Mapleson Daniel Dr (CMP)"
>>><D.Mapleson at uea.ac.uk>
>>>>>>>wrote:
>>>>>>>
>>>>>>>>I'm running Windows 7 Enterprise 64-bit and Java 1.6 64-bit.  I
>>>>>double
>>>>>>>>checked I was using the 64-bit version of java at runtime using
>>>>>>>JConsole.
>>>>>>>> I'm also using biojava3, in case that makes a difference.
>>>>>>>>
>>>>>>>>I tried the FastaReaderHelper.readFastaDNASequence(File f)
>version
>>>of
>>>>>>>>the method, but I still haven't found the lazySequenceLoad
>version.
>>>>>>>>Same problem.
>>>>>>>>
>>>>>>>>Best regards,
>>>>>>>>Dan
>>>>>>>>
>>>>>>>>>-----Original Message-----
>>>>>>>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>>>>>>>Sent: Wednesday, February 09, 2011 2:48 PM
>>>>>>>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>>>>>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many
>>>DNA
>>>>>>>>>Sequences
>>>>>>>>>
>>>>>>>>>Dan
>>>>>>>>>
>>>>>>>>>I usually do 40MB DNA files with no problem. I will concat
>>>together
>>>>>>>>>and test a 245MB version. What operating system and version of
>>>Java?
>>>>>>>>>32bit or 64bit? The GC should be able to keep up in
>>>lazySequenceLoad
>>>>>>>>>mode.
>>>>>>>>>
>>>>>>>>>You need to use File because an inputstream doesn't provide the
>>>>>>>>>ability to random seek based on an offset as it could be an HTTP
>>>>>>>>>stream etc.
>>>>>>>>>
>>>>>>>>>Thanks
>>>>>>>>>
>>>>>>>>>Scooter
>>>>>>>>>
>>>>>>>>>On 2/9/11 9:38 AM, "Mapleson Daniel Dr (CMP)"
>>><D.Mapleson at uea.ac.uk>
>>>>>>>>>wrote:
>>>>>>>>>
>>>>>>>>>>Hi Scooter,
>>>>>>>>>>
>>>>>>>>>>Thanks for the quick feedback.
>>>>>>>>>>
>>>>>>>>>>Unfortunately, the memory isn't the issue, I set my JVM to use
>>>>>2500MB
>>>>>>>>>max
>>>>>>>>>>heap (the most I can get away with on my machine), and still
>>>>>>>>>encountered
>>>>>>>>>>the same problem.  On a colleagues machine he has the max heap
>>>>>>>>>>essentially unbounded and he still gets the same error.  It
>seems
>>>>>to
>>>>>>>>>>be something to do with the garbage collector removing
>temporary
>>>>>>>>>>items
>>>>>>>>>from
>>>>>>>>>>memory rather than max available memory.
>>>>>>>>>>http://stackoverflow.com/questions/1393486/what-means-the-
>error-
>>>>>>>>>message-ja
>>>>>>>>>>va-lang-outofmemoryerror-gc-overhead-limit-excee
>>>>>>>>>>
>>>>>>>>>>Also the FastaReaderHelper.readFastaDNASequence method ran into
>>>the
>>>>>>>>>same
>>>>>>>>>>problem.  I passed in an input stream rather than a file but I
>>>>>don't
>>>>>>>>>>think that should cause the problem, should it?  Also I
>couldn't
>>>>>find
>>>>>>>>>an
>>>>>>>>>>overloaded variant with the lazySequenceLoad signature.  Maybe
>>>I'm
>>>>>>>>>using
>>>>>>>>>>an older version but I couldn't find it in the biojava3 API
>docs
>>>>>>>>>either.
>>>>>>>>>>
>>>>>>>>>>Any other ideas?
>>>>>>>>>>
>>>>>>>>>>Best regards,
>>>>>>>>>>Dan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>-----Original Message-----
>>>>>>>>>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>>>>>>>>>Sent: Wednesday, February 09, 2011 1:31 PM
>>>>>>>>>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>>>>>>>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with
>many
>>>>>DNA
>>>>>>>>>>>Sequences
>>>>>>>>>>>
>>>>>>>>>>>Daniel
>>>>>>>>>>>
>>>>>>>>>>>You have two options. The first is to run java -Xmx2048m (the
>>>rest
>>>>>>>>>>>of your
>>>>>>>>>>>parameters) and the out of memory error will go away. I have a
>>>>>>>>>>>Helper method that will read the fasta file and lazy load when
>>>you
>>>>>>>>>>>request a sequence. If you call this method
>>>>>>>>>>>
>>>>>>>>>>>FastaReaderHelper.readFastaDNASequnece(File f, boolean
>>>>>>>>>lazySequenceLoad)
>>>>>>>>>>>you will be able to load the entire fasta file with minimal
>>>memory
>>>>>>>>>>>requirements.
>>>>>>>>>>>
>>>>>>>>>>>Even though your fasta file is X when we load it into memory
>>>each
>>>>>>>>>>>sequence position gets represented by a Java object so the
>>>memory
>>>>>>>>>>>footprint
>>>>>>>>>will
>>>>>>>>>>>be
>>>>>>>>>>>larger.
>>>>>>>>>>>
>>>>>>>>>>>Let me know if you don't have that particular method in the
>jars
>>>>>you
>>>>>>>>>are
>>>>>>>>>>>using. Not sure of the latest release on jars. If you look in
>>>the
>>>>>>>>>>>biojava3-genome module you will find examples of working with
>>>the
>>>>>>>>>>>DNA sequences to translate proteins etc assuming you have CDS
>>>>>>>>>>>features to map onto your sequences.
>>>>>>>>>>>
>>>>>>>>>>>Thanks
>>>>>>>>>>>
>>>>>>>>>>>Scooter
>>>>>>>>>>>
>>>>>>>>>>>On 2/9/11 7:13 AM, "Mapleson Daniel Dr (CMP)"
>>>>><D.Mapleson at uea.ac.uk>
>>>>>>>>>>>wrote:
>>>>>>>>>>>
>>>>>>>>>>>>Hello,
>>>>>>>>>>>>
>>>>>>>>>>>>I'm trying to read a FASTA file that contains just over 4000
>>>DNA
>>>>>>>>>>>>sequences and is around 270MB big.  Each sequence starts like
>>>>>this:
>>>>>>>>>>>>">SequenceName" followed by a linefeed.  The actual DNA
>>>sequence
>>>>>>>>>>>>data does contain a linefeed every 40 characters or so.
>>>>>>>>>>>>
>>>>>>>>>>>>I want to read in the data into a LinkedHashMap object,
>similar
>>>>>to
>>>>>>>>>the
>>>>>>>>>>>>example you specify in your cookbook:
>>>>>>>>>>>>
>>>>>>>>>>>>FileInputStream inStream = new FileInputStream(genomeFile);
>>>>>>>>>>>>FastaReader<DNASequence, NucleotideCompound> fastaReader =
>new
>>>>>>>>>>>>FastaReader<DNASequence, NucleotideCompound>(
>>>>>>>>>>>>              inStream,
>>>>>>>>>>>>              new GenericFastaHeaderParser<DNASequence,
>>>>>>>>>>>>NucleotideCompound>(),
>>>>>>>>>>>>              new
>>>>>>>>>>>>DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet(
>))
>>>);
>>>>>>>>>>>>
>>>>>>>>>>>>        try {
>>>>>>>>>>>>            genomeData = fastaReader.process();
>>>>>>>>>>>>        } catch (Exception ex) { }
>>>>>>>>>>>>
>>>>>>>>>>>>This works on some files but not the one containing the 4000
>>>>>>>>>sequences.
>>>>>>>>>>>>I get an exception generated by the JVM:
>>>>>>>>>>>>
>>>>>>>>>>>>Exception in thread "AWT-EventQueue-0"
>>>>>java.lang.OutOfMemoryError:
>>>>>>>>>>>>GC overhead limit exceeded
>>>>>>>>>>>>        at
>>>>>>>>>>>>java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.
>ja
>>>va
>>>>>:4
>>>>>>>5)
>>>>>>>>>>>>        at
>>>java.lang.StringBuilder.<init>(StringBuilder.java:80)
>>>>>>>>>>>>        at
>>>>>>>>>>>>org.biojava3.core.sequence.io.FastaReader.process(FastaReader
>.j
>>>av
>>>>>a:
>>>>>>>>>>>>11
>>>>>>>>>1)
>>>>>>>>>>>>        at
>>>>>>>>>>>>workbench.Process_Hits_Patman.openFastaFile(Process_Hits_Patm
>an
>>>.j
>>>>>av
>>>>>>>a:
>>>>>>>>>41
>>>>>>>>>>>4)
>>>>>>>>>>>>        at
>>>>>>>>>>>>workbench.Process_Hits_Patman.readFiles(Process_Hits_Patman.j
>av
>>>a:
>>>>>45
>>>>>>>1)
>>>>>>>>>>>>        at workbench.MirCat.openFile(MirCat.java:283)
>>>>>>>>>>>>        at
>>>>>>>>>>>>workbench.MainMDIWindow.openMenuItemActionPerformed(MainMDIWi
>nd
>>>ow
>>>>>.j
>>>>>>>>>>>>av
>>>>>>>>>a:
>>>>>>>>>>>252
>>>>>>>>>>>>)
>>>>>>>>>>>>        at
>>>>>>>workbench.MainMDIWindow.access$000(MainMDIWindow.java:25)
>>>>>>>>>>>>        at
>>>>>>>>>>>>workbench.MainMDIWindow$1.actionPerformed(MainMDIWindow.java:
>14
>>>0)
>>>>>>>>>>>>        at
>>>>>>>>>>>>javax.swing.AbstractButton.fireActionPerformed(AbstractButton
>.j
>>>av
>>>>>a:
>>>>>>>>>>>>19
>>>>>>>>>95
>>>>>>>>>>>)
>>>>>>>>>>>>        at
>>>>>>>>>>>>javax.swing.AbstractButton$Handler.actionPerformed(AbstractBu
>tt
>>>on
>>>>>.j
>>>>>>>>>>>>av
>>>>>>>>>a:
>>>>>>>>>>>231
>>>>>>>>>>>>8)
>>>>>>>>>>>>        at
>>>>>>>>>>>>javax.swing.DefaultButtonModel.fireActionPerformed(DefaultBut
>to
>>>nM
>>>>>od
>>>>>>>>>>>>el
>>>>>>>>>.j
>>>>>>>>>>>ava
>>>>>>>>>>>>:387)
>>>>>>>>>>>>        at
>>>>>>>>>>>>javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.
>ja
>>>va
>>>>>:2
>>>>>>>>>>>>42
>>>>>>>>>)
>>>>>>>>>>>>        at
>>>>>>>>>javax.swing.AbstractButton.doClick(AbstractButton.java:357)
>>>>>>>>>>>>        at
>>>>>>>>>>>>javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemU
>I.
>>>ja
>>>>>va
>>>>>>>>>>>>:1
>>>>>>>>>22
>>>>>>>>>>>5)
>>>>>>>>>>>>        at
>>>>>>>>>>>>javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(
>Ba
>>>si
>>>>>cM
>>>>>>>>>>>>en
>>>>>>>>>uI
>>>>>>>>>>>tem
>>>>>>>>>>>>UI.java:1266)
>>>>>>>>>>>>        at
>>>>>>>java.awt.Component.processMouseEvent(Component.java:6263)
>>>>>>>>>>>>        at
>>>>>>>>>>>javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>>>>>>>>>>>>        at
>java.awt.Component.processEvent(Component.java:6028)
>>>>>>>>>>>>        at
>java.awt.Container.processEvent(Container.java:2041)
>>>>>>>>>>>>        at
>>>>>>>java.awt.Component.dispatchEventImpl(Component.java:4630)
>>>>>>>>>>>>        at
>>>>>>>java.awt.Container.dispatchEventImpl(Container.java:2099)
>>>>>>>>>>>>        at
>>>java.awt.Component.dispatchEvent(Component.java:4460)
>>>>>>>>>>>>        at
>>>>>>>>>>>>java.awt.LightweightDispatcher.retargetMouseEvent(Container.j
>av
>>>a:
>>>>>45
>>>>>>>>>>>>74
>>>>>>>>>)
>>>>>>>>>>>>        at
>>>>>>>>>>>>java.awt.LightweightDispatcher.processMouseEvent(Container.ja
>va
>>>:4
>>>>>23
>>>>>>>8)
>>>>>>>>>>>>        at
>>>>>>>>>>>>java.awt.LightweightDispatcher.dispatchEvent(Container.java:4
>16
>>>8)
>>>>>>>>>>>>        at
>>>>>>>java.awt.Container.dispatchEventImpl(Container.java:2085)
>>>>>>>>>>>>        at
>java.awt.Window.dispatchEventImpl(Window.java:2475)
>>>>>>>>>>>>        at
>>>java.awt.Component.dispatchEvent(Component.java:4460)
>>>>>>>>>>>>        at
>>>java.awt.EventQueue.dispatchEvent(EventQueue.java:599)
>>>>>>>>>>>>        at
>>>>>>>>>>>>java.awt.EventDispatchThread.pumpOneEventForFilters(EventDisp
>at
>>>ch
>>>>>Th
>>>>>>>>>>>>re
>>>>>>>>>ad
>>>>>>>>>>>.ja
>>>>>>>>>>>>va:269)
>>>>>>>>>>>>        at
>>>>>>>>>>>>java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatc
>hT
>>>hr
>>>>>ea
>>>>>>>d.
>>>>>>>>>ja
>>>>>>>>>>>va:
>>>>>>>>>>>>184)
>>>>>>>>>>>>
>>>>>>>>>>>>The amount of memory on the system isn't an issue.  I tried
>>>this
>>>>>on
>>>>>>>>>>>>a machine with 12GB of RAM.  It seems to be an issue with the
>>>>>>>>>>>>garbage collector getting tired of deleting temporary
>objects!
>>>>>>>>>>>>Also I
>>>>>>>>>noticed
>>>>>>>>>>>>that although the file is less than 300MB large, the actual
>>>>>amount
>>>>>>>>>>>>of heap space used increases from 100MB to over 900MB when in
>>>>>>>>>>>>FastaReader.Process before the exception occurs.
>>>>>>>>>>>>
>>>>>>>>>>>>Unfortunately I can't share the FASTA file that is causing
>the
>>>>>>>>>problem.
>>>>>>>>>>>>
>>>>>>>>>>>>Would it be possible for you guys to look into this and
>either
>>>>>>>>>produce
>>>>>>>>>>>a
>>>>>>>>>>>>fix or suggest a workaround?  Also do you think there is
>>>someway
>>>>>to
>>>>>>>>>>>>optimise the performance and memory usage of this process?
>>>>>>>>>>>>
>>>>>>>>>>>>Finally, I have a question about selectively loading
>sequences
>>>>>from
>>>>>>>>>>>>a FASTA file.  The idea being to reduce memory usage.  Is it
>>>>>>>>>possibility
>>>>>>>>>>>to
>>>>>>>>>>>>do this using biojava?  i.e. given a DNA sequence name, only
>>>load
>>>>>>>>>that
>>>>>>>>>>>>sequence into memory?  Or do we have to load the entire FASTA
>>>>>file
>>>>>>>>>into
>>>>>>>>>>>a
>>>>>>>>>>>>LinkedHashMap each time?
>>>>>>>>>>>>
>>>>>>>>>>>>Thanks in advance for your help on this one,
>>>>>>>>>>>>
>>>>>>>>>>>>Best regards,
>>>>>>>>>>>>Dr Daniel Mapleson (UEA)
>>>>>>>>>>>>
>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>biojava-dev mailing list
>>>>>>>>>>>>biojava-dev at lists.open-bio.org
>>>>>>>>>>>>http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>