[Biojava-dev] Bug when reading FASTA file with many DNA Sequences[Scanned]

Jolyon Holdstock jolyon.holdstock at ogt.co.uk
Wed Feb 9 12:49:21 UTC 2011


Hi,

Have you set a value for the maximum Java heap size - the -Xmx option.

When I have an OutOfMemoryError I increase that first.

Cheers,

Jolyon


-----Original Message-----
From: Mapleson Daniel Dr (CMP) [mailto:D.Mapleson at uea.ac.uk] 
Sent: 09 February 2011 12:13
To: biojava-dev at lists.open-bio.org
Subject: [Biojava-dev] Bug when reading FASTA file with many DNA Sequences[Scanned]

Hello,

I'm trying to read a FASTA file that contains just over 4000 DNA sequences and is around 270MB big.  Each sequence starts like this: ">SequenceName" followed by a linefeed.  The actual DNA sequence data does contain a linefeed every 40 characters or so.

I want to read in the data into a LinkedHashMap object, similar to the example you specify in your cookbook:

FileInputStream inStream = new FileInputStream(genomeFile);
FastaReader<DNASequence, NucleotideCompound> fastaReader = new FastaReader<DNASequence, NucleotideCompound>(
              inStream,
              new GenericFastaHeaderParser<DNASequence, NucleotideCompound>(),
              new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet()));

        try {
            genomeData = fastaReader.process();
        } catch (Exception ex) { }

This works on some files but not the one containing the 4000 sequences.  I get an exception generated by the JVM:

Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
        at java.lang.StringBuilder.<init>(StringBuilder.java:80)
        at org.biojava3.core.sequence.io.FastaReader.process(FastaReader.java:111)
        at workbench.Process_Hits_Patman.openFastaFile(Process_Hits_Patman.java:414)
        at workbench.Process_Hits_Patman.readFiles(Process_Hits_Patman.java:451)
        at workbench.MirCat.openFile(MirCat.java:283)
        at workbench.MainMDIWindow.openMenuItemActionPerformed(MainMDIWindow.java:252)
        at workbench.MainMDIWindow.access$000(MainMDIWindow.java:25)
        at workbench.MainMDIWindow$1.actionPerformed(MainMDIWindow.java:140)
        at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995)
        at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2318)
        at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
        at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
        at javax.swing.AbstractButton.doClick(AbstractButton.java:357)
        at javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1225)
        at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1266)
        at java.awt.Component.processMouseEvent(Component.java:6263)
        at javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
        at java.awt.Component.processEvent(Component.java:6028)
        at java.awt.Container.processEvent(Container.java:2041)
        at java.awt.Component.dispatchEventImpl(Component.java:4630)
        at java.awt.Container.dispatchEventImpl(Container.java:2099)
        at java.awt.Component.dispatchEvent(Component.java:4460)
        at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4574)
        at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238)
        at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
        at java.awt.Container.dispatchEventImpl(Container.java:2085)
        at java.awt.Window.dispatchEventImpl(Window.java:2475)
        at java.awt.Component.dispatchEvent(Component.java:4460)
        at java.awt.EventQueue.dispatchEvent(EventQueue.java:599)
        at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269)
        at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184)

The amount of memory on the system isn't an issue.  I tried this on a machine with 12GB of RAM.  It seems to be an issue with the garbage collector getting tired of deleting temporary objects!  Also I noticed that although the file is less than 300MB large, the actual amount of heap space used increases from 100MB to over 900MB when in FastaReader.Process before the exception occurs.

Unfortunately I can't share the FASTA file that is causing the problem.

Would it be possible for you guys to look into this and either produce a fix or suggest a workaround?  Also do you think there is someway to optimise the performance and memory usage of this process?

Finally, I have a question about selectively loading sequences from a FASTA file.  The idea being to reduce memory usage.  Is it possibility to do this using biojava?  i.e. given a DNA sequence name, only load that sequence into memory?  Or do we have to load the entire FASTA file into a LinkedHashMap each time?

Thanks in advance for your help on this one,

Best regards,
Dr Daniel Mapleson (UEA)

_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-dev











This email has been scanned by Oxford Gene Technology Security Systems.




More information about the biojava-dev mailing list