[Biojava-l] BioSQL successes, failures and questions

Mon Jun 14 15:35:37 EDT 2004

Hi everyone,
I finally can make BioSQL 1.38(MySQL) work with
BioJava(biojava-20040608.jar). Hope my frustrating learning experience can
help other new developers a little bit. I deeply appreciate it if any
experienced developers can answer my new questions.

Before you can start, you have to create a database and user later you are
going to use to connect to the database. Please refer MySQL mannual if you
don't know how to do it.

  1. Run the standard BioSQL 1.38 schema to create an empty sequence
database. I changed the unique key term.name(UNIQUE `name`
(`name`,`ontology_id`)) to key (KEY `name` (`name`,`ontology_id`)) since
later when I tried to upload sequences I got duplicate value exceptions. Any
suggestions on this?

  2. Run the "term_relationship_term" table creation schema. Without this
table, the BioJava package will throw exceptions. See ATTACHMENT 1 at the
end of this message.

  3. Create the "biodatabase" entry. Say you are uploading Genbank files,
you need create an entry in biodatabase table with value of "genbank" for
the "name" column.

  4. If you are trying to upload flat files such as Genbank files, try
ATTACHMENT 2, derived from the demo code UploadFlat.java. If the new
BioSQLSequenceDB() line does not work, try another constructor with
"dbDriver" parameter. I put all required jar files in the
JDK_ROOT/jre/lib/ext so JVM can load them automatically. BioJava needs very
much RAM. I suggest put the Xmx???m to be 10 times big as the Genbank file.
Say you have a Genbank file of 40 MB, use "java -Xmx400m
your_file_full_path".

I just uploaded the 5 Genbank genome files of Arabidopsis thaliana and it
works pretty well. Though digging information from the tables needs a lot
joins(especially with term table), it's still much much easier than writing
your own code. So far, I'm happy with it.

Questions:

1. How can I quickly pull certain pieces of sequence  from the biosequence
table? I tried to join the seqfeature, seqfeature_qualifier_value, location,
biosequence, term tables to retrieve all gene sequences. It turned out to be
not doable because the substring function is extremely slow when applying on
the 'seq' column. The following SQL takes about 20 seconds on a dual Xeon
2.6G Dell PowerEdge 2650 server and the required time varies when the gene
locations are different.

SELECT t1.seqfeature_id,t1.bioentry_id,t2.start_pos, t2.end_pos, t2.strand,
t4.value locus_tag,
substring(t6.seq, t2.start_pos,t2.end_pos) seq
FROM `seqfeature` t1 inner join location t2 on
t1.seqfeature_id=t2.seqfeature_id
inner join term t3 on t1.type_term_id=t3.term_id
inner join seqfeature_qualifier_value t4 on
t1.seqfeature_id=t4.seqfeature_id
inner join term t5 on t4.term_id=t5.term_id
inner join biosequence t6 on t1.bioentry_id=t6.bioentry_id
where t3.name='gene' and t5.name='locus_tag'
limit 2

=========================================
ATTACHMENT 1
=========================================
CREATE TABLE `term_relationship_term` (
  `term_relationship_id` int(11) NOT NULL default '0',
  `term_id` int(11) NOT NULL default '0',
  PRIMARY KEY  (`term_relationship_id`,`term_id`),
  UNIQUE KEY `term_relationship_id` (`term_relationship_id`),
  UNIQUE KEY `term_id` (`term_id`)
) TYPE=InnoDB;
========================================

=========================================
ATTACHMENT 2
=========================================
import java.io.*;

import org.biojava.bio.*;
import org.biojava.bio.symbol.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.seq.io.*;
import org.biojava.bio.seq.db.*;
import org.biojava.bio.seq.db.biosql.*;
import org.biojava.bio.taxa.*;

public class UploadFlat {
    public static void main(String [] args) {
        try {
            String dbDriver="com.mysql.jdbc.Driver";
            String dbURL =
"jdbc:mysql://mysql_server_ip_address_or_domain_name/biosql_database_name";
            String dbUser = "database_user";
            String dbPass = "password";
            String bioDB = "biodatabase.name";
            String format = "genbank"; //you can change this or input from
command line

            System.setProperty("jdbc.drivers", "com.mysql.jdbc.Driver");
            SequenceDB seqDB = new BioSQLSequenceDB(
            dbURL,
            dbUser,
            dbPass,
            bioDB,
            false
            );

            SequenceFormat sFormat;
            SequenceBuilderFactory sbFact;
            Alphabet alpha;

            if ("embl".equalsIgnoreCase(format)) {
                sFormat = new EmblLikeFormat();
                sbFact = new
EmblProcessor.Factory(SimpleSequenceBuilder.FACTORY);
                alpha = DNATools.getDNA();
            } else if ("swissprot".equalsIgnoreCase(format)) {
                sFormat = new EmblLikeFormat();
                sbFact = new SwissprotProcessor.Factory(
                SimpleSequenceBuilder.FACTORY
                );
                alpha = ProteinTools.getAlphabet();
            } else if ("fasta".equalsIgnoreCase(format)) {
                sFormat = new FastaFormat();
                sbFact = new
FastaDescriptionLineParser.Factory(SimpleSequenceBuilder.FACTORY);
                alpha = DNATools.getDNA();
            } else if ("fasta-protein".equalsIgnoreCase(format)) {
                sFormat = new FastaFormat();
                sbFact = new
FastaDescriptionLineParser.Factory(SimpleSequenceBuilder.FACTORY);
                alpha = ProteinTools.getAlphabet();
            } else if ("genbank".equalsIgnoreCase(format)) {
                sFormat = new GenbankFormat();
                sbFact = new
GenbankProcessor.Factory(SimpleSequenceBuilder.FACTORY);
                alpha = DNATools.getDNA();
            }else {
                System.err.println("Unknown format: " + format);
                return;
            }

            SymbolTokenization rParser = alpha.getTokenization("token");

                File swissProtFile = new File(args[0]);
                BufferedReader sReader = new BufferedReader(new
InputStreamReader(new FileInputStream(swissProtFile)));
                SequenceIterator seqI =
                new	StreamReader(sReader, sFormat, rParser,	sbFact);

                while(seqI.hasNext()) {
                    try {
                        System.out.print(".");
                        Sequence seq = seqI.nextSequence();
                        seqDB.addSequence(seq);
                    } catch (Throwable t) {
                        t.printStackTrace(System.out);
                    }
                }
        }
        catch (Throwable t) {
            t.printStackTrace();
            System.exit(1);
        }
    }
}
=========================================