[Bioperl-l] bioperl-db issues
Chris Fields
cjfields at uiuc.edu
Tue Feb 21 21:58:07 UTC 2006
Sorry about the huge delay in this response, got caught up with other
things.
> > Bad News: There's a new problem now. I updated from CVS yesterday; I
> > walked
> > through the steps and ran 'nmake test', with everything passing fine.
> > However, load_seqdatabase.pl is extremely slow; it's loading a sequence
> > every 5 minutes or so. I noticed (when using '-debug') that it is
> > hanging
> > up in Bio::DB::BioSQL::SpeciesAdaptor each time. If I create a
> > database,
> > load the biosql schema, and load sequences w/o loading taxonomy, the
> > problem
> > goes away.
> >
> > Here's the debugging output (I cut it off at the point it hangs up):
> > [...]
>
> > preparing UK select statement: SELECT taxon_name.taxon_id, NULL, NULL,
> > taxon.ncbi_taxon_id, taxon_name.name, NULL FROM taxon, taxon_name WHERE
> > taxon.taxon_id = taxon_name.taxon_id AND name_class = ? AND
> > ncbi_taxon_id =
> > ?
> > SpeciesAdaptor: binding UK column 1 to "scientific name" (name_class)
> > SpeciesAdaptor: binding UK column 2 to "208964" (ncbi_taxid)
>
> I'm a bit surprised if this is the query where it hangs. Are the
> indexes all there? There should be a primary key index on
> taxon.taxon_id, unique indexes on taxon.ncbi_taxon_id and on taxon_name
> over (taxon_id,name,name_class). Also, there should be separate indexes
> on taxon_name.taxon_id and taxon_name.name. Are they all there? If you
> reinstantiated the schema from the DDL then it seems unlikely that
> somehow the indexes have vanished except if you messed with the schema
> or the DDL.
So far everything looks like you mentioned (see below for the ANALYZE
stuff). The only thing that I wasn't sure about was that taxon_name indexes
were all primary keys. That's really it.
> Putting an index on taxon_name.name_class really can't make sense, so
> let's assume it can't be that.
>
> So really I suspect this has something to do with the state of the
> database and the version of MySQL. In particular, from some 4.x version
> of MySQL under certain circumstances you have to analyze the statistics
> of the tables in order to get the optimizer pick up the indexes
> properly. Are you on MySQL 4.x and if so, have you done that?
>
> There's the ANALYZE TABLE command:
> http://dev.mysql.com/doc/refman/4.1/en/analyze-table.html
>
> Note the comment: "This statement works with MyISAM, BDB, and (as of
> MySQL 4.0.13) InnoDB tables." Is your MySQL version 4.0.13 or higher?
>
> Also, you can check the execution plan for the query using EXPLAIN.
> http://dev.mysql.com/doc/refman/4.1/en/explain.html
>
> This should show you whether the index would be picked up for the query
> or not. EXPLAIN as well as ANALYZE TABLE will need you to connect to
> the db using the mysql shell (mysql).
>
> I believe something similarly strange was encountered by someone using
> DB::GFF (or Chado) under MySQL, and if I recall correctly the solution
> was to optimize (analyze) the tables. Maybe someone who was in that
> thread reads this and can comment?
I find it odd that it worked well back in December and doesn't work now. I
updated bioperl and bioperl-db from CVS since then, so have there been any
changes that may have caused this? I noticed a few changes here and there.
Here's what I have tried thus far:
1) I reinstalled MySQL. I thought it might be that I had my database on a
partitioned drive, so I reinstalled on the main drive.
2) I rebuilt the database from scratch, loading taxonomy fresh, loaded the
schema, and got the same error when loading (hanging on SpeciesAdaptor.
Tried ANALYZE:
------------------------------------
mysql> ANALYZE TABLE taxon;
+----------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------+---------+----------+----------+
| bioseqdb.taxon | analyze | status | OK |
+----------------+---------+----------+----------+
1 row in set (0.42 sec)
mysql> ANALYZE TABLE taxon_name;
+---------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+---------------------+---------+----------+----------+
| bioseqdb.taxon_name | analyze | status | OK |
+---------------------+---------+----------+----------+
1 row in set (0.36 sec)
mysql>
------------------------------------
so that's fine.
3) Using EXPLAIN table:
------------------------------------
mysql> EXPLAIN taxon;
+-------------------+---------------------+------+-----+---------+----------
------+
| Field | Type | Null | Key | Default | Extra
|
+-------------------+---------------------+------+-----+---------+----------
------+
| taxon_id | int(10) unsigned | NO | PRI | NULL |
auto_increment |
| ncbi_taxon_id | int(10) | YES | UNI | NULL |
|
| parent_taxon_id | int(10) unsigned | YES | MUL | NULL |
|
| node_rank | varchar(32) | YES | | NULL |
|
| genetic_code | tinyint(3) unsigned | YES | | NULL |
|
| mito_genetic_code | tinyint(3) unsigned | YES | | NULL |
|
| left_value | int(10) unsigned | YES | UNI | NULL |
|
| right_value | int(10) unsigned | YES | UNI | NULL |
|
+-------------------+---------------------+------+-----+---------+----------
------+
8 rows in set (0.02 sec)
mysql> EXPLAIN taxon_name;
+------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+-------+
| taxon_id | int(10) unsigned | NO | PRI | | |
| name | varchar(255) | NO | PRI | | |
| name_class | varchar(32) | NO | PRI | | |
+------------+------------------+------+-----+---------+-------+
3 rows in set (0.00 sec)
------------------------------------
Does taxon_name need three primary keys?
4) So I tried reloading the sequences:
------------------------------------
C:\Perl\src\bioperl\bioperl-db\scripts\biosql>load_seqdatabase.pl -format
genbank -dbname bioseqdb -dbuser root -dbpass ********** -testonly -safe
-debug NP_249092.gpt
And got this:
Loading NP_249092.gpt ...
attempting to load adaptor class for Bio::Seq::RichSeq
attempting to load module Bio::DB::BioSQL::RichSeqAdaptor
......
SimpleValueAdaptor::add_assoc: binding column 4 to "1" (rank)
SimpleValueAdaptor::add_assoc: binding column 1 to "21" (FK to
Bio::SeqFeature::Generic)
SimpleValueAdaptor::add_assoc: binding column 2 to "34" (FK to
Bio::Annotation::SimpleValue)
SimpleValueAdaptor::add_assoc: binding column 3 to "11" (value)
SimpleValueAdaptor::add_assoc: binding column 4 to "1" (rank)
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
no adaptor found for class Bio::Annotation::TypeManager
BioNamespaceAdaptor: binding UK column 1 to "bioperl" (namespace)
SpeciesAdaptor: binding UK column 1 to "scientific name" (name_class)
SpeciesAdaptor: binding UK column 2 to "208963" (ncbi_taxid)
------------------------------------
Which is where it hangs, as before, usually about 2 minutes for each
sequence. It seems there's a timeout happening in there somewhere... It
definitely has something to do with the lookup, but like I said it did run
much faster last Nov-Dec.
So I'm a bit lost now. Any ideas?
I may try re-optimizing tables to see if it helps any.
I'm also really thinking of giving postgresql a shot but I have used mysql
for a while now; I'd like to stay with it if I can.
Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list