[Biojava-l] New To BioJava.org, right in the question.

Dunarel Badescu dunarel at gmx.net
Sat Nov 18 02:48:41 UTC 2006


My name is Dunarel Badescu, a student at UQAM University in Montreal,Quebec,Canada in Graduate Diploma of Bioinformatics.

Currently I am using BioJavaX and BioSql for my session project.

I have parsed NCBI GeneBank files using RichSequences.
I then inserted the sequences into the database.

Several problems arise:

1) A bug in the code pulled from CVS:
In class BioSQLRichObjectBuilder: I had to append some code for the program to find the right constructors:

// Get the results
Object result = this.uniqueResult.invoke(query, null);
// Return the found object, if found
if (result!=null) return result;
// Create, persist and return the new object otherwise
else {

if (SimpleDocRef.class.isAssignableFrom(clazz)) {
// convert String to List constructor 
// Load the class

2) Meny memory problems, after inserting 800 sequences it slows extremely so performance is degraded.
I thought of a hibernate cache problem and tried to turn it off by setting some parameters: 
        <property name="hibernate.jdbc.batch_size">20</property>
        <property name="hibernate.cache.enabled">false</property>
        <property name="hibernate.cache.use_query_cache">false</property>
        <property name="hibernate.cache.use_second_level_cache">false</property>
        <property name="hibernate.connection.aggressive_release">false</property> 
        <property name="cache.provider_class">org.hibernate.cache.NoCacheProvider</property>
        <property name="cache.use_query_cache">false</property>
        <property name="cache.use_minimal_puts">false</property>
        <property name="max_fetch_depth">3</property> 
to no much benefit.

Then I observed some small performance gain by using :
  session.save("Sequence",rs);  // persist the sequence
The session.evict(rs);

Any atempt to dealocate memory by closing the session, the session factory either generates errors or it will generate on reopening.

So as a last resort I fragmented the original aprox. 130 mb containing one taxon from ncbi in 38 files 1000 sequences each and made a dos batch script executing the program in the commad line for each file.
So that way it works but:

3) Inserting rows sometimes generates exceptions in the references table.
After taking it more closely I found that by disabling the unique constraint on the dbxref_id on references table solves all the remaining problems.

The coment about it on the original code is:
-- No two references can reference the same reference database entry 
-- (dbxref_id). This is where the MEDLINE id goes: PUBMED:123456. 
and the modification is:

--UNIQUE ( dbxref_id ) , 

I must say that the script for creating the biosql schema is version 1.29 from the cvs, the most recent I found.

And I must say that for running the script on Postgresql 8.1.5 I had to modify each create table statement adding with oids at the end, now that 8.1.5 doesn't create oids by default.

It must have be a more elegant aproch to all these problems isn't it?

At least the constraint situation, I mean is it normal to exist or not as it seems.

I wish You all the best and thank you for your work which is most useful and scarce as a resource.

Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

More information about the Biojava-l mailing list