[Biojava-l] For BioJava List: Possible solution to Hibernate and slow Atom storage.

Andreas Prlic andreas at sdsc.edu
Fri Oct 2 23:43:40 UTC 2009


Hi Paul,
Your mail only made it through to the mailing list today, since it contained
an attachment....

Probably due to the IO restraints it is easier to keep loading Atom records
from the flat files, rather than serialising them to a database. It should
be a minor change to add a "header only" flag to the parsers... will do over
the weekend...

Andreas




On Mon, Mar 23, 2009 at 6:06 AM, <tallpaulinjax at yahoo.com> wrote:

>
>
> Hi Andreas,
>
> I saw this post on slow Atom storage using BioJava and Hibernate:
> http://www.biojava.org/wiki/BioJava:CookBook:PDB:hibernate
> I may have an acceptable work-around. I had the same problem, but have
> 'flattened' the object structure I am using with Hibernate and got a 20x to
> 25x performance improvement. The problem was the number of objects being
> held in memory as the file was parsed, similar to how BioJava can run out of
> heap space. In my design, a PdbMeta record can have hundreds or more of
> ModelChainResidue objects, and each ModelChainResidue object could have
> dozens of AtomNorm records. Just to load a 300kb PDB file into BioJava then
> into my database could take 25 minutes! And I have 4,000 of these files to
> load!
>
> So what I did is explained in this post:
> http://forum.hibernate.org/viewtopic.php?p=2409385#2409385
>
> Basically, per PdbMeta row I only kept around the currently needed
> ModelChainResidue object and AtomNorm object, and garbage collected any
> others. This meant I couldn't use one big session/transaction and had to
> split this up into separate transactions, but I gained 25x load times into
> my database (HIbernate doesn't support nested transactions, and I couldn't
> see a way within a transaction to remove an object without subsequently
> deleting it from the database). My plan is to include a Boolean
> 'useFastLoad' parameter to the method calls which will turn this feature on
> and off as-needed. With 4,000 PDB files to load, each one taking a minute or
> so on average to download, parse into BioJava, and then dump to my database
> (on my laptop for testing, moving to my server soon), 1 minute per file will
> still take almost 3 days.
>
> Perhaps BioJava could use the same strategy with Hibernate?
>
> Paul
>
> PS: Here is some background on what I am using BioJava for:
>
> I am using BioJava to parse PDB files, then converting from BioJava's
> object structure to one more specific to my needs. I am working with two
> chemists at The University Of North Florida on this project, which supports
> my Masters in C.S. thesis. I have attached a preliminary schema (hopefully
> it won't inflame the SPAM filter :-) ). I just added the PeriodicTable table
> last night and have to adjust the AtomNorm and AtomDenorm tables
> accordingly. Basically, the schema is as follows:
>
> 1. We imported a high-level list of over 4,000 "representative sample" PDB
> files into the RepresentativeSample table. This will be used as part of the
> basis to start filling the PdbMeta, ModelChainResidue, and AtomNorm tables.
> 2. There may be more of these representative sample lists in the future, so
> each batch of imports has an entry in RepresentativeSampleMeta.
> 3. Each PdbMeta entry is unique by PDB Code, DepositionDate, and
> ModificationDate.
> 4. A PdbMeta entry can have 0 or more child ModelChainResidue records
> (usually hundreds).
> 5. A ModelChainResidue record can have dozens of child AtomNorm records.
> 6. For data mining purposes and join improvements, pertinent info from
> PdbMeta, ModelChainResidue, and AtomNorm are dumped into AtomDenorm.
> 7. The "Lkp" tables are merely static 'helper' tables whose number of
> records and field entries are expected to remain static.
> 8. The Error table is where any errors found in the data are dumped by the
> Java program, by table name and then primary key within that table.
> 9. MethylDonatedHydrogen table: one of the key areas of interest for the
> UNF Chemists, including Dr. Robert Vergenz.
>
> (BTW, I can't figure out how to get the RFactor out of BioJava, and
> apparently BioJava is removing 'Unknown amino acids' before I have a chance
> to parse them and add them to my Error table as well... solutions to both
> those problems? Does BioJava somewhere have a "hasErrors" field based on
> parsing?)
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>



More information about the Biojava-l mailing list