[Bioperl-l] Problems with biosql

Christopher Mason Mason.Christopher at mayo.edu
Sun Sep 7 17:04:13 EDT 2003


Howdy-

I'm using the latest CVS versions (as of 5 Sep 03) of bioperl-live, 
bioperl-db, and biosql-schema with the latest PostgreSQL (7.3.4) and Perl 
(5.8.0). I'm trying to load all of swiss-prot into a biosql database, and 
it's not going well.  Although my ultimate goal is to manipulate this 
database from java, I'm using perl because the various docs I've read seem 
to indicate this is the way to go for loading (if it's not, please tell 
me).

There are some errors output when loading the schema (see below).  But in 
general, creating the database seems to work.

However, when trying to run:

> bioperl-db/scripts/biosql/load_seqdatabase.pl --dbname biosql
>    --driver Pg --format swiss --dbuser cmason
>    --namespace bioperl sprot.dat

I immediately get this error:

> Could not store P15711:
> ------------- EXCEPTION  -------------
> MSG: You're trying to lie about the length: is 102 but you say 924

(P15711 is the very first entry in the file.)
(Full traceback below.)

which seems to be generated here:

Bio/PrimarySeq.pm:419
>                        "You're trying to lie about the length: ".
>                          "is $len but you say ".$val);

called from here:

Bio/DB/BioSQL/BiosequenceAdaptor.pm:252
>         $obj->alphabet($rows->[3]) if $rows->[3];
>         $obj->seq($rows->[4]) if $rows->[4];
>         $obj->length($rows->[2]) if $rows->[2];  # <---- 252
>         if($obj->isa("Bio::DB::PersistentObjectI") &&

$rows is

> [1, undef, 924, protein, undef, 1]


Commenting out the indicated line seems to prevent this error message. 
However, then I get, about two days later, this message:

> Out of memory!

The state of the database is odd:

> biosql=# select count(bioentry_id) from bioentry;
>  count
> -------
>      1
> (1 row)

but:

> biosql=# select count (seqfeature_id) from location;
>  count
> -------
>   1329
> (1 row)

and:

># du -sk /home/postgres/
> 739724  /home/postgres

(There are no other database besides biosql.)

(I tried VACUUMing the database which caused it to grow by about 100MB, but 
nothing else shows up.)

It's hard to tell how far it's gotten when it runs out of memory.  I sort 
of expected the size of the finished database to be somewhat larger than 
the size of the flat file.

But even if it's almost finished, it's incredibly slow (at least 1,300 
minutes of user time, not counting postgres).  Would mysql be much faster? 
Or should I simply be prepared to wait a long time?

Has anyone tried this recently (importing all of swiss prot into a biosql 
database) with any database (postgres, mysql, oracle, etc.)?  If so, can 
you give me (even approximate) performance numbers (for loading, selecting 
a sequence, etc.) and ultimate database size on disk?  I'm trying to 
determine if this is a viable way of architecting my application (which 
incidentally, will probably be written in java, not perl).

Also, why is this code spread out over three different CVS modules?

Thanks,

-c

When loading the schema:

>> psql biosql < biosqldb-views-pg.sql
> ERROR:  Relation "seqfeature_key" does not exist
> ERROR:  view "gff" does not exist
> ERROR:  Relation "ontology_term" does not exist
> ERROR:  Relation "ontology_term" does not exist
> ERROR:  Relation "fasta" does not exist
> ERROR:  Relation "ontology_term" does not exist
> ERROR:  parser: parse error at end of input
> ERROR:  RemoveFunction: function compl(text) does not exist
> CREATE FUNCTION
> ERROR:  RemoveFunction: function reverse(text) does not exist
> ERROR:  stat failed on file
> '/home/cjm/cvs/biosql-schema/ext/biosqldb-funcs.so': No such file or
> directory ERROR:  Function reverse("unknown") does not exist
>         Unable to identify a function that satisfies the given argument
> types         You may need to add explicit typecasts
> ERROR:  RemoveFunction: function get_subseq(text, integer, integer,
> integer) does not exist CREATE FUNCTION
>  get_subseq
> ------------
>  bc
> (1 row)
>
> ERROR:  view "gffseq" does not exist
> ERROR:  Relation "seqfeature_key" does not exist
> ERROR:  Relation "seqfeature_key_v" does not exist
> ERROR:  Relation "seqfeature_key_v" does not exist
> ERROR:  Relation "seqfeature_key_v" does not exist
> ERROR:  Relation "seqfeature_key_v" does not exist
> ERROR:  Relation "seqfeature_key_v" does not exist
> ERROR:  Relation "seqfeature_key_v" does not exist
> ERROR:  Relation "seqfeature_key_v" does not exist
> ERROR:  Relation "seqfeature_key_v" does not exist
> ERROR:  Relation "seqfeature_key_v" does not exist

and:

>> psql biosql < biosql-accelerators-pg.sql
> ERROR:  RemoveFunction: function biosql_accelerators_level() does not
> exist CREATE FUNCTION
> ERROR:  RemoveFunction: function intern_ontology_term(text) does not exist
> CREATE FUNCTION
> ERROR:  RemoveFunction: function intern_seqfeature_source(text) does not
> exist CREATE FUNCTION
> ERROR:  RemoveFunction: function create_seqfeature(integer, text, text)
> does not exist CREATE FUNCTION
> ERROR:  RemoveFunction: function create_seqfeature_onespan(integer, text,
> text, integer, integer, integer) does not exist CREATE FUNCTION


Then when trying to load:


> ------------- EXCEPTION  -------------
> MSG: You're trying to lie about the length: is 102 but you say 924
> STACK Bio::PrimarySeq::length
> /usr/lib/perl5/site_perl/5.8.0/Bio/PrimarySeq.pm:419 STACK
> Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/PersistentObject.pm:541
> STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.0/Bio/Seq.pm:612
> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/PersistentObject.pm:541
> STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BiosequenceAdaptor.pm:254
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:12
> 78 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:966
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:851
> STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284
> STACK Bio::DB::BioSQL::SeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/SeqAdaptor.pm:279 STACK
> Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:13
> 09 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:966
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:851
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:204
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:253
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/PersistentObject.pm:270
> STACK (eval) ./load_seqdatabase.pl:446
> STACK toplevel ./load_seqdatabase.pl:429
>
> --------------------------------------





-- 
[ Christopher Mason    MPRC Bioinformatics    cjm37 at mayo.edu ]



More information about the Bioperl-l mailing list