[Bioperl-l] Problems with biosql

Fri Sep 19 14:36:29 EDT 2003

First off, sorry for the late reply. I've been on vacation.

On 9/7/03 2:04 PM, "Christopher Mason" <Mason.Christopher at mayo.edu> wrote:

> Howdy-
> 
> I'm using the latest CVS versions (as of 5 Sep 03) of bioperl-live,
> bioperl-db, and biosql-schema with the latest PostgreSQL (7.3.4) and Perl
> (5.8.0). I'm trying to load all of swiss-prot into a biosql database, and
> it's not going well.  Although my ultimate goal is to manipulate this
> database from java, I'm using perl because the various docs I've read seem
> to indicate this is the way to go for loading (if it's not, please tell
> me).

You should be able to use biojava for loading as well, but bioperl-db has
the much configurable script load_seqdatabase.pl.

First off, after you unpacked or downloaded bioperl-db, did you run  the
test suite? Did all tests pass?

> 
> There are some errors output when loading the schema (see below).  But in
> general, creating the database seems to work.
> 
> However, when trying to run:
> 
>> bioperl-db/scripts/biosql/load_seqdatabase.pl --dbname biosql
>>    --driver Pg --format swiss --dbuser cmason
>>    --namespace bioperl sprot.dat
> 
> I immediately get this error:
> 
>> Could not store P15711:
>> ------------- EXCEPTION  -------------
>> MSG: You're trying to lie about the length: is 102 but you say 924
> 
> (P15711 is the very first entry in the file.)
> (Full traceback below.)

This would mean that P15711 was already in the database, with an erroneous
(truncated?) sequence. Did you load that before, and did you manipulate the
seq column in biosequence?

> 
> which seems to be generated here:
> 
> Bio/PrimarySeq.pm:419
>>                        "You're trying to lie about the length: ".
>>                          "is $len but you say ".$val);
> 
> called from here:
> 
> Bio/DB/BioSQL/BiosequenceAdaptor.pm:252
>>         $obj->alphabet($rows->[3]) if $rows->[3];
>>         $obj->seq($rows->[4]) if $rows->[4];
>>         $obj->length($rows->[2]) if $rows->[2];  # <---- 252
>>         if($obj->isa("Bio::DB::PersistentObjectI") &&
> 
> $rows is
> 
>> [1, undef, 924, protein, undef, 1]
> 
> 
> Commenting out the indicated line seems to prevent this error message.
> However, then I get, about two days later, this message:
> 
>> Out of memory!

These two should be unrelated. If all goes correctly, the length column
should have a value identical to the length of the sequence, and therefore
there should be no exception.

In your case, the length of the sequence as it comes back from the database
is much shorter than the one in the flat file. Somewhere in between it must
have gotten truncated or updated.

> 
> The state of the database is odd:
> 
>> biosql=# select count(bioentry_id) from bioentry;
>>  count
>> -------
>>      1
>> (1 row)
> 
> but:
> 
>> biosql=# select count (seqfeature_id) from location;
>>  count
>> -------
>>   1329
>> (1 row)

This is indeed odd. There should be UK and FK constraints that should
completely prevent this from happening. Have you tinkered with the
constraints? What is the result of

    biosql=# select count(distinct bioentry_id) from seqfeature;

and

    biosql=# select count(distinct seqfeature_id) from location;

I suspect there is something wrong with either the installation of Pg or
with the instantiation of biosql. I'd start over with a fresh instance. Try
to insert something made up into seqfeature. It should fail. Same for
location. Try to cut out the first say 10 entries from sprot and load them
and see whether you succeed. You can supply --testonly to have everything
rolled back at the end automatically.

Also, note that it is advisable to pre-load the taxonomy database (see
load_taxonomy.pl in the biosql-schema module). Otherwise you'll see errors
from misparsed organisms.

> 
> and:
> 
>> # du -sk /home/postgres/
>> 739724  /home/postgres
> 
> (There are no other database besides biosql.)
> 
> (I tried VACUUMing the database which caused it to grow by about 100MB, but
> nothing else shows up.)

You mean shrink, not grow?

> 
> It's hard to tell how far it's gotten when it runs out of memory.  I sort
> of expected the size of the finished database to be somewhat larger than
> the size of the flat file.

It should be at least 3-4 times that size b/c of indexes.

> 
> But even if it's almost finished, it's incredibly slow (at least 1,300
> minutes of user time, not counting postgres).  Would mysql be much faster?
> Or should I simply be prepared to wait a long time?
> 
> Has anyone tried this recently (importing all of swiss prot into a biosql
> database) with any database (postgres, mysql, oracle, etc.)?  If so, can
> you give me (even approximate) performance numbers (for loading, selecting
> a sequence, etc.) and ultimate database size on disk?

I'm doing this on a regular basis for a biosql instance on Oracle. Swissprot
and TREMBL load (or update) over night when run in parallel.

The problem with Pg is that is relies a lot on index statistics. When you're
first building your database, you ramp up the content from zero to a couple
hundred thousand bioentries (and 3-5 times more features, annotation
associations, etc). So, a while into the upload run, the statistics are
grossly wrong and may lead Pg making the wrong decisions on the query plan.
You can try to run vacuumdb --analyze every hour or so while it's loading
(you can run this concurrently meanwhile) to try to remedy this problem.

> I'm trying to 
> determine if this is a viable way of architecting my application (which
> incidentally, will probably be written in java, not perl).

This is along our architecture here. We do the data loads in perl, whereas
the web-app for browsing and searching uses the J2EE stack. Works fine.

> 
> Also, why is this code spread out over three different CVS modules?

Because those are the components. Biosql is the schema (which, as you
noticed already, is not tied to perl or bioperl in any way), then you have
the bio* library according to your language preference (e.g., bioperl), and
the third component is the language bindings, which for bioperl has
traditionally been in its own module (bioperl-db).

> 
> Thanks,
> 
> -c
> 
> When loading the schema:
> 
>>> psql biosql < biosqldb-views-pg.sql

This hasn't been updated for the latest version of the schema. You don't
need this.

>> ERROR:  Relation "seqfeature_key" does not exist
>> ERROR:  view "gff" does not exist
>> ERROR:  Relation "ontology_term" does not exist
>> ERROR:  Relation "ontology_term" does not exist
>> ERROR:  Relation "fasta" does not exist
>> ERROR:  Relation "ontology_term" does not exist
>> ERROR:  parser: parse error at end of input
>> ERROR:  RemoveFunction: function compl(text) does not exist
>> CREATE FUNCTION
>> ERROR:  RemoveFunction: function reverse(text) does not exist
>> ERROR:  stat failed on file
>> '/home/cjm/cvs/biosql-schema/ext/biosqldb-funcs.so': No such file or
>> directory ERROR:  Function reverse("unknown") does not exist
>>         Unable to identify a function that satisfies the given argument
>> types         You may need to add explicit typecasts
>> ERROR:  RemoveFunction: function get_subseq(text, integer, integer,
>> integer) does not exist CREATE FUNCTION
>>  get_subseq
>> ------------
>>  bc
>> (1 row)
>> 
>> ERROR:  view "gffseq" does not exist
>> ERROR:  Relation "seqfeature_key" does not exist
>> ERROR:  Relation "seqfeature_key_v" does not exist
>> ERROR:  Relation "seqfeature_key_v" does not exist
>> ERROR:  Relation "seqfeature_key_v" does not exist
>> ERROR:  Relation "seqfeature_key_v" does not exist
>> ERROR:  Relation "seqfeature_key_v" does not exist
>> ERROR:  Relation "seqfeature_key_v" does not exist
>> ERROR:  Relation "seqfeature_key_v" does not exist
>> ERROR:  Relation "seqfeature_key_v" does not exist
>> ERROR:  Relation "seqfeature_key_v" does not exist
> 
> and:
> 
>>> psql biosql < biosql-accelerators-pg.sql

Same applies here. Although I'm not sure whether biojava still depends on it
or not.

    -hilmar
>> ERROR:  RemoveFunction: function biosql_accelerators_level() does not
>> exist CREATE FUNCTION
>> ERROR:  RemoveFunction: function intern_ontology_term(text) does not exist
>> CREATE FUNCTION
>> ERROR:  RemoveFunction: function intern_seqfeature_source(text) does not
>> exist CREATE FUNCTION
>> ERROR:  RemoveFunction: function create_seqfeature(integer, text, text)
>> does not exist CREATE FUNCTION
>> ERROR:  RemoveFunction: function create_seqfeature_onespan(integer, text,
>> text, integer, integer, integer) does not exist CREATE FUNCTION
> 
> 
> Then when trying to load:
> 
> 
>> ------------- EXCEPTION  -------------
>> MSG: You're trying to lie about the length: is 102 but you say 924
>> STACK Bio::PrimarySeq::length
>> /usr/lib/perl5/site_perl/5.8.0/Bio/PrimarySeq.pm:419 STACK
>> Bio::DB::Persistent::PersistentObject::AUTOLOAD
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/PersistentObject.pm:541
>> STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.0/Bio/Seq.pm:612
>> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/PersistentObject.pm:541
>> STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BiosequenceAdaptor.pm:254
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:12
>> 78 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:966
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:851
>> STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284
>> STACK Bio::DB::BioSQL::SeqAdaptor::attach_children
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/SeqAdaptor.pm:279 STACK
>> Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:13
>> 09 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:966
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:851
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:204
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:253
>> STACK Bio::DB::Persistent::PersistentObject::store
>> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/PersistentObject.pm:270
>> STACK (eval) ./load_seqdatabase.pl:446
>> STACK toplevel ./load_seqdatabase.pl:429
>> 
>> --------------------------------------
> 
> 
> 
> 

-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------