[Bioperl-l] Indexing large databases / BioSQL
Hilmar Lapp
hlapp at gmx.net
Wed Apr 2 02:30:06 UTC 2008
On Apr 1, 2008, at 8:31 AM, Bánk Beszteri wrote:
> [...] So next we started to test BioSQL, by trying to load just
> Swissprot in a MySQL DB first, like:
>
> load_seqdatabase.pl --host mysql.awi.de --dbname biosql2 --dbuser
> xyz --dbpass abc --driver mysql --namespace uniprot_sprot --format
> swiss uniprot_sprot.dat
>
> Here we get an error message
>
> ###########################################
>
> Loading /biodb/spinkern/uniprot_sprot.dat ...
> Could not store Q6DAH5:
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: The supplied lineage does not start near 'Erwinia carotovora
> subsp. atroseptica' (I was supplied 'Erwinia carotovora subsp. |
> Pectobacterium | Enterobacteriaceae | Enterobacteriales |
> Gammaproteobacteria | Proteobacteria | Bacteria')
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /biodb/spinkern/bioperl-1.5/
> bioperl-1.5.2_102/Bio/Root/Root.pm:359
> STACK: Bio::Species::classification /biodb/spinkern/bioperl-1.5/
> bioperl-1.5.2_102/Bio/Species.pm:174
> STACK: Bio::DB::Persistent::PersistentObject::AUTOLOAD /biodb/
> spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:
> 552
> STACK: Bio::DB::BioSQL::SpeciesAdaptor::populate_from_row /biodb/
> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SpeciesAdaptor.pm:281
> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object /
> biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:1305
> STACK:
> Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key /biodb/
> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:973
> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /
> biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:852
> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /biodb/
> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:182
> STACK: Bio::DB::Persistent::PersistentObject::create /biodb/
> spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:
> 244
> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /biodb/
> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:169
> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /biodb/
> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:251
> STACK: Bio::DB::Persistent::PersistentObject::store /biodb/spinkern/
> bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271
> STACK: load_seqdatabase.pl:622
> -----------------------------------------------------------
>
> at load_seqdatabase.pl line 635
>
> ############################################
>
> or similar, depending on whether we use a pre-loaded ncbi taxonomy
> or not
I recommend to always use a pre-loaded NCBI taxonomy unless you know
there are only a few organisms that are straightforward (for the
parser, that is).
> , and which Swissprot release we are trying to load. It often seems
> to come from sg. like here, subsp. or other special addition to the
> species line; but alternative genus names and other curious things
> also to appear. It looks like Species.pm tries to validate the
> species name against the lineage info already there in the BioSQL
> DB, and in several cases, it finds inconsistencies.
It actually happens upon a successful lookup when the species object
is populated from the database.
> [...]
> The only workaround we have found so far was to comment out line
> 174 in Species.pm:
>
> $self->throw("The supplied lineage does not start near '$name' (I
> was supplied '".join(" | ", @vals)."')");
That should be OK if you work with a pre-loaded taxonomy. It's sort
of a sanity check that should catch a parser having messed up a
species. If you use a pre-loaded NCBI taxonomy the results of the
species parsing don't matter in all details so long as the NCBI
taxonID is parsed out correctly, and then found in the database.
Note that this actually a warn() in the main trunk version of
BioPerl, so you might want to upgrade to that (or change throw() to
warn() in your version). You still get the records flagged with that,
but it isn't an exception.
>
> After doing so, load_seqdatabase.pl runs for several hours (until
> it evetually crashes; I haven´t found out yet why), but proceeds
> really slowly.
It should certainly *not* crash. Note also that you can supply --safe
on the command line, in which case the script will continue with the
next record if one fails to load for whatever reason.
You will want to adjust the width constraint of dbxref.accession, for
example to 128 chars. This will also be fixed for BioSQL 1.0.1.
See http://bugzilla.open-bio.org/show_bug.cgi?id=2474
> I also found some info on this for Pg and Oracle in the mailing
> list, but has anyone some approximate numbers for MySQL, how long
> should a first Swissprot load take?
Possibly around 20 hours according to Erik Rijkers:
See http://lists.open-bio.org/pipermail/bioperl-l/2008-March/027427.html
You can use the --logchunks N option to have it print out performance
statistics every N records.
Hope this helps,
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the Bioperl-l
mailing list