[Bioperl-l] bioperl-db issues

Tue Feb 21 21:58:07 UTC 2006

Sorry about the huge delay in this response, got caught up with other
things.

> > Bad News:  There's a new problem now. I updated from CVS yesterday; I
> > walked
> > through the steps and ran 'nmake test', with everything passing fine.
> > However, load_seqdatabase.pl is extremely slow; it's loading a sequence
> > every 5 minutes or so.  I noticed (when using '-debug') that it is
> > hanging
> > up in Bio::DB::BioSQL::SpeciesAdaptor each time.  If I create a
> > database,
> > load the biosql schema, and load sequences w/o loading taxonomy, the
> > problem
> > goes away.
> >
> > Here's the debugging output (I cut it off at the point it hangs up):
> > [...]
> 
> > preparing UK select statement: SELECT taxon_name.taxon_id, NULL, NULL,
> > taxon.ncbi_taxon_id, taxon_name.name, NULL FROM taxon, taxon_name WHERE
> > taxon.taxon_id = taxon_name.taxon_id AND name_class = ? AND
> > ncbi_taxon_id =
> > ?
> > SpeciesAdaptor: binding UK column 1 to "scientific name" (name_class)
> > SpeciesAdaptor: binding UK column 2 to "208964" (ncbi_taxid)
> 
> I'm a bit surprised if this is the query where it hangs. Are the
> indexes all there? There should be a primary key index on
> taxon.taxon_id, unique indexes on taxon.ncbi_taxon_id and on taxon_name
> over (taxon_id,name,name_class). Also, there should be separate indexes
> on taxon_name.taxon_id and taxon_name.name. Are they all there? If you
> reinstantiated the schema from the DDL then it seems unlikely that
> somehow the indexes have vanished except if you messed with the schema
> or the DDL.

So far everything looks like you mentioned (see below for the ANALYZE
stuff).  The only thing that I wasn't sure about was that taxon_name indexes
were all primary keys.  That's really it.

> Putting an index on taxon_name.name_class really can't make sense, so
> let's assume it can't be that.
> 
> So really I suspect this has something to do with the state of the
> database and the version of MySQL. In particular, from some 4.x version
> of MySQL under certain circumstances you have to analyze the statistics
> of the tables in order to get the optimizer pick up the indexes
> properly. Are you on MySQL 4.x and if so, have you done that?
> 
> There's the ANALYZE TABLE command:
> http://dev.mysql.com/doc/refman/4.1/en/analyze-table.html
> 
> Note the comment: "This statement works with MyISAM, BDB, and (as of
> MySQL 4.0.13) InnoDB tables." Is your MySQL version 4.0.13 or higher?
>
> Also, you can check the execution plan for the query using EXPLAIN.
> http://dev.mysql.com/doc/refman/4.1/en/explain.html
> 
> This should show you whether the index would be picked up for the query
> or not. EXPLAIN as well as ANALYZE TABLE will need you to connect to
> the db using the mysql shell (mysql).
> 
> I believe something similarly strange was encountered by someone using
> DB::GFF (or Chado) under MySQL, and if I recall correctly the solution
> was to optimize (analyze) the tables. Maybe someone who was in that
> thread reads this and can comment?

I find it odd that it worked well back in December and doesn't work now.  I
updated bioperl and bioperl-db from CVS since then, so have there been any
changes that may have caused this?  I noticed a few changes here and there.

Here's what I have tried thus far:

1) I reinstalled MySQL.  I thought it might be that I had my database on a
partitioned drive, so I reinstalled on the main drive.

2) I rebuilt the database from scratch, loading taxonomy fresh, loaded the
schema, and got the same error when loading (hanging on SpeciesAdaptor.
Tried ANALYZE:
------------------------------------
mysql> ANALYZE TABLE taxon;
+----------------+---------+----------+----------+
| Table          | Op      | Msg_type | Msg_text |
+----------------+---------+----------+----------+
| bioseqdb.taxon | analyze | status   | OK       |
+----------------+---------+----------+----------+
1 row in set (0.42 sec)

mysql> ANALYZE TABLE taxon_name;
+---------------------+---------+----------+----------+
| Table               | Op      | Msg_type | Msg_text |
+---------------------+---------+----------+----------+
| bioseqdb.taxon_name | analyze | status   | OK       |
+---------------------+---------+----------+----------+
1 row in set (0.36 sec)

mysql>
------------------------------------
so that's fine.  

3) Using EXPLAIN table:
------------------------------------
mysql> EXPLAIN taxon;
+-------------------+---------------------+------+-----+---------+----------
------+
| Field             | Type                | Null | Key | Default | Extra
|
+-------------------+---------------------+------+-----+---------+----------
------+
| taxon_id          | int(10) unsigned    | NO   | PRI | NULL    |
auto_increment |
| ncbi_taxon_id     | int(10)             | YES  | UNI | NULL    |
|
| parent_taxon_id   | int(10) unsigned    | YES  | MUL | NULL    |
|
| node_rank         | varchar(32)         | YES  |     | NULL    |
|
| genetic_code      | tinyint(3) unsigned | YES  |     | NULL    |
|
| mito_genetic_code | tinyint(3) unsigned | YES  |     | NULL    |
|
| left_value        | int(10) unsigned    | YES  | UNI | NULL    |
|
| right_value       | int(10) unsigned    | YES  | UNI | NULL    |
|
+-------------------+---------------------+------+-----+---------+----------
------+
8 rows in set (0.02 sec)

mysql> EXPLAIN taxon_name;
+------------+------------------+------+-----+---------+-------+
| Field      | Type             | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+-------+
| taxon_id   | int(10) unsigned | NO   | PRI |         |       |
| name       | varchar(255)     | NO   | PRI |         |       |
| name_class | varchar(32)      | NO   | PRI |         |       |
+------------+------------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

------------------------------------
Does taxon_name need three primary keys?

4) So I tried reloading the sequences:
------------------------------------
C:\Perl\src\bioperl\bioperl-db\scripts\biosql>load_seqdatabase.pl -format
genbank -dbname bioseqdb -dbuser root -dbpass ********** -testonly -safe
-debug NP_249092.gpt

And got this:

Loading NP_249092.gpt ...
attempting to load adaptor class for Bio::Seq::RichSeq
        attempting to load module Bio::DB::BioSQL::RichSeqAdaptor
......
SimpleValueAdaptor::add_assoc: binding column 4 to "1" (rank)
SimpleValueAdaptor::add_assoc: binding column 1 to "21" (FK to
Bio::SeqFeature::Generic)
SimpleValueAdaptor::add_assoc: binding column 2 to "34" (FK to
Bio::Annotation::SimpleValue)
SimpleValueAdaptor::add_assoc: binding column 3 to "11" (value)
SimpleValueAdaptor::add_assoc: binding column 4 to "1" (rank)
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
BioNamespaceAdaptor: binding UK column 1 to "bioperl" (namespace)
SpeciesAdaptor: binding UK column 1 to "scientific name" (name_class)
SpeciesAdaptor: binding UK column 2 to "208963" (ncbi_taxid)
------------------------------------
Which is where it hangs, as before, usually about 2 minutes for each
sequence.  It seems there's a timeout happening in there somewhere...  It
definitely has something to do with the lookup, but like I said it did run
much faster last Nov-Dec.

So I'm a bit lost now.  Any ideas?  

I may try re-optimizing tables to see if it helps any.

I'm also really thinking of giving postgresql a shot but I have used mysql
for a while now; I'd like to stay with it if I can.

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign