[Bioperl-l] Problems with biosql
Christopher Mason
Mason.Christopher at mayo.edu
Sun Sep 7 17:04:13 EDT 2003
Howdy-
I'm using the latest CVS versions (as of 5 Sep 03) of bioperl-live,
bioperl-db, and biosql-schema with the latest PostgreSQL (7.3.4) and Perl
(5.8.0). I'm trying to load all of swiss-prot into a biosql database, and
it's not going well. Although my ultimate goal is to manipulate this
database from java, I'm using perl because the various docs I've read seem
to indicate this is the way to go for loading (if it's not, please tell
me).
There are some errors output when loading the schema (see below). But in
general, creating the database seems to work.
However, when trying to run:
> bioperl-db/scripts/biosql/load_seqdatabase.pl --dbname biosql
> --driver Pg --format swiss --dbuser cmason
> --namespace bioperl sprot.dat
I immediately get this error:
> Could not store P15711:
> ------------- EXCEPTION -------------
> MSG: You're trying to lie about the length: is 102 but you say 924
(P15711 is the very first entry in the file.)
(Full traceback below.)
which seems to be generated here:
Bio/PrimarySeq.pm:419
> "You're trying to lie about the length: ".
> "is $len but you say ".$val);
called from here:
Bio/DB/BioSQL/BiosequenceAdaptor.pm:252
> $obj->alphabet($rows->[3]) if $rows->[3];
> $obj->seq($rows->[4]) if $rows->[4];
> $obj->length($rows->[2]) if $rows->[2]; # <---- 252
> if($obj->isa("Bio::DB::PersistentObjectI") &&
$rows is
> [1, undef, 924, protein, undef, 1]
Commenting out the indicated line seems to prevent this error message.
However, then I get, about two days later, this message:
> Out of memory!
The state of the database is odd:
> biosql=# select count(bioentry_id) from bioentry;
> count
> -------
> 1
> (1 row)
but:
> biosql=# select count (seqfeature_id) from location;
> count
> -------
> 1329
> (1 row)
and:
># du -sk /home/postgres/
> 739724 /home/postgres
(There are no other database besides biosql.)
(I tried VACUUMing the database which caused it to grow by about 100MB, but
nothing else shows up.)
It's hard to tell how far it's gotten when it runs out of memory. I sort
of expected the size of the finished database to be somewhat larger than
the size of the flat file.
But even if it's almost finished, it's incredibly slow (at least 1,300
minutes of user time, not counting postgres). Would mysql be much faster?
Or should I simply be prepared to wait a long time?
Has anyone tried this recently (importing all of swiss prot into a biosql
database) with any database (postgres, mysql, oracle, etc.)? If so, can
you give me (even approximate) performance numbers (for loading, selecting
a sequence, etc.) and ultimate database size on disk? I'm trying to
determine if this is a viable way of architecting my application (which
incidentally, will probably be written in java, not perl).
Also, why is this code spread out over three different CVS modules?
Thanks,
-c
When loading the schema:
>> psql biosql < biosqldb-views-pg.sql
> ERROR: Relation "seqfeature_key" does not exist
> ERROR: view "gff" does not exist
> ERROR: Relation "ontology_term" does not exist
> ERROR: Relation "ontology_term" does not exist
> ERROR: Relation "fasta" does not exist
> ERROR: Relation "ontology_term" does not exist
> ERROR: parser: parse error at end of input
> ERROR: RemoveFunction: function compl(text) does not exist
> CREATE FUNCTION
> ERROR: RemoveFunction: function reverse(text) does not exist
> ERROR: stat failed on file
> '/home/cjm/cvs/biosql-schema/ext/biosqldb-funcs.so': No such file or
> directory ERROR: Function reverse("unknown") does not exist
> Unable to identify a function that satisfies the given argument
> types You may need to add explicit typecasts
> ERROR: RemoveFunction: function get_subseq(text, integer, integer,
> integer) does not exist CREATE FUNCTION
> get_subseq
> ------------
> bc
> (1 row)
>
> ERROR: view "gffseq" does not exist
> ERROR: Relation "seqfeature_key" does not exist
> ERROR: Relation "seqfeature_key_v" does not exist
> ERROR: Relation "seqfeature_key_v" does not exist
> ERROR: Relation "seqfeature_key_v" does not exist
> ERROR: Relation "seqfeature_key_v" does not exist
> ERROR: Relation "seqfeature_key_v" does not exist
> ERROR: Relation "seqfeature_key_v" does not exist
> ERROR: Relation "seqfeature_key_v" does not exist
> ERROR: Relation "seqfeature_key_v" does not exist
> ERROR: Relation "seqfeature_key_v" does not exist
and:
>> psql biosql < biosql-accelerators-pg.sql
> ERROR: RemoveFunction: function biosql_accelerators_level() does not
> exist CREATE FUNCTION
> ERROR: RemoveFunction: function intern_ontology_term(text) does not exist
> CREATE FUNCTION
> ERROR: RemoveFunction: function intern_seqfeature_source(text) does not
> exist CREATE FUNCTION
> ERROR: RemoveFunction: function create_seqfeature(integer, text, text)
> does not exist CREATE FUNCTION
> ERROR: RemoveFunction: function create_seqfeature_onespan(integer, text,
> text, integer, integer, integer) does not exist CREATE FUNCTION
Then when trying to load:
> ------------- EXCEPTION -------------
> MSG: You're trying to lie about the length: is 102 but you say 924
> STACK Bio::PrimarySeq::length
> /usr/lib/perl5/site_perl/5.8.0/Bio/PrimarySeq.pm:419 STACK
> Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/PersistentObject.pm:541
> STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.0/Bio/Seq.pm:612
> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/PersistentObject.pm:541
> STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BiosequenceAdaptor.pm:254
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:12
> 78 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:966
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:851
> STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284
> STACK Bio::DB::BioSQL::SeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/SeqAdaptor.pm:279 STACK
> Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:13
> 09 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:966
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:851
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:204
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:253
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/PersistentObject.pm:270
> STACK (eval) ./load_seqdatabase.pl:446
> STACK toplevel ./load_seqdatabase.pl:429
>
> --------------------------------------
--
[ Christopher Mason MPRC Bioinformatics cjm37 at mayo.edu ]
More information about the Bioperl-l
mailing list