[Biojava-l] For BioJava List: Possible solution to Hibernate and slow Atom storage.

tallpaulinjax at yahoo.com tallpaulinjax at yahoo.com
Mon Mar 23 13:13:12 UTC 2009



Hi Andreas,
 
I saw this post on slow Atom storage using BioJava and Hibernate:
http://www.biojava.org/wiki/BioJava:CookBook:PDB:hibernate
I may have an acceptable work-around. I had the same problem, but have 'flattened' the object structure I am using with Hibernate and got a 20x to 25x performance improvement. The problem was the number of objects being held in memory as the file was parsed, similar to how BioJava can run out of heap space. In my design, a PdbMeta record can have hundreds or more of ModelChainResidue objects, and each ModelChainResidue object could have dozens of AtomNorm records. Just to load a 300kb PDB file into BioJava then into my database could take 25 minutes! And I have 4,000 of these files to load! 
 
So what I did is explained in this post:
http://forum.hibernate.org/viewtopic.php?p=2409385#2409385
 
Basically, per PdbMeta row I only kept around the currently needed ModelChainResidue object and AtomNorm object, and garbage collected any others. This meant I couldn't use one big session/transaction and had to split this up into separate transactions, but I gained 25x load times into my database (HIbernate doesn't support nested transactions, and I couldn't see a way within a transaction to remove an object without subsequently deleting it from the database). My plan is to include a Boolean 'useFastLoad' parameter to the method calls which will turn this feature on and off as-needed. With 4,000 PDB files to load, each one taking a minute or so on average to download, parse into BioJava, and then dump to my database (on my laptop for testing, moving to my server soon), 1 minute per file will still take almost 3 days.
 
Perhaps BioJava could use the same strategy with Hibernate?
 
Paul
 
PS: Here is some background on what I am using BioJava for:
 
I am using BioJava to parse PDB files, then converting from BioJava's object structure to one more specific to my needs. I am working with two chemists at The University Of North Florida on this project, which supports my Masters in C.S. thesis. I have attached a preliminary schema (hopefully it won't inflame the SPAM filter :-) ). I just added the PeriodicTable table last night and have to adjust the AtomNorm and AtomDenorm tables accordingly. Basically, the schema is as follows:
 
1. We imported a high-level list of over 4,000 "representative sample" PDB files into the RepresentativeSample table. This will be used as part of the basis to start filling the PdbMeta, ModelChainResidue, and AtomNorm tables.
2. There may be more of these representative sample lists in the future, so each batch of imports has an entry in RepresentativeSampleMeta.
3. Each PdbMeta entry is unique by PDB Code, DepositionDate, and ModificationDate. 
4. A PdbMeta entry can have 0 or more child ModelChainResidue records (usually hundreds).
5. A ModelChainResidue record can have dozens of child AtomNorm records.
6. For data mining purposes and join improvements, pertinent info from PdbMeta, ModelChainResidue, and AtomNorm are dumped into AtomDenorm.
7. The "Lkp" tables are merely static 'helper' tables whose number of records and field entries are expected to remain static.
8. The Error table is where any errors found in the data are dumped by the Java program, by table name and then primary key within that table.
9. MethylDonatedHydrogen table: one of the key areas of interest for the UNF Chemists, including Dr. Robert Vergenz.
 
(BTW, I can't figure out how to get the RFactor out of BioJava, and apparently BioJava is removing 'Unknown amino acids' before I have a chance to parse them and add them to my Error table as well... solutions to both those problems? Does BioJava somewhere have a "hasErrors" field based on parsing?)
 
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OverviewSchema.pdf
Type: application/pdf
Size: 89251 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20090323/dcabd576/attachment-0002.pdf>


More information about the Biojava-l mailing list