[Bioperl-l] Indexing large databases / BioSQL
Bánk Beszteri
Bank.Beszteri at awi.de
Tue Apr 1 12:31:49 UTC 2008
Dear list,
we have recently started to try to find a solution for indexing large
sequence databases / flat files for a java project, and because we ran
into problems using biojava, and because both the OBDA and BioSQL ways
seem to be compatible across bio~ projects, we also started to
experiment with bioperl. It looks like this should work fine, but we had
a couple of problems here, too. Perhaps some of you can give me hint
what we are doing wrong!
The first thing we tried was to use Bio::DB::Flat for indexing a TrEMBL
flat file (~ 12 GB); but it seems we haven´t got a machine with enough
memory to be able to handle this. (Perhaps you would be using the "bdb"
style index in such a case in bioperl, but this apparently doesn´t work
with biojava, so we had to stick with "flat"). So next we started to
test BioSQL, by trying to load just Swissprot in a MySQL DB first, like:
load_seqdatabase.pl --host mysql.awi.de --dbname biosql2 --dbuser xyz
--dbpass abc --driver mysql --namespace uniprot_sprot --format swiss
uniprot_sprot.dat
Here we get an error message
###########################################
Loading /biodb/spinkern/uniprot_sprot.dat ...
Could not store Q6DAH5:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: The supplied lineage does not start near 'Erwinia carotovora subsp.
atroseptica' (I was supplied 'Erwinia carotovora subsp. | Pectobacterium
| Enterobacteriaceae | Enterobacteriales | Gammaproteobacteria |
Proteobacteria | Bacteria')
STACK: Error::throw
STACK: Bio::Root::Root::throw
/biodb/spinkern/bioperl-1.5/bioperl-1.5.2_102/Bio/Root/Root.pm:359
STACK: Bio::Species::classification
/biodb/spinkern/bioperl-1.5/bioperl-1.5.2_102/Bio/Species.pm:174
STACK: Bio::DB::Persistent::PersistentObject::AUTOLOAD
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:552
STACK: Bio::DB::BioSQL::SpeciesAdaptor::populate_from_row
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SpeciesAdaptor.pm:281
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:1305
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:973
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:852
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:182
STACK: Bio::DB::Persistent::PersistentObject::create
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:244
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK: Bio::DB::Persistent::PersistentObject::store
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271
STACK: load_seqdatabase.pl:622
-----------------------------------------------------------
at load_seqdatabase.pl line 635
############################################
or similar, depending on whether we use a pre-loaded ncbi taxonomy or
not, and which Swissprot release we are trying to load. It often seems
to come from sg. like here, subsp. or other special addition to the
species line; but alternative genus names and other curious things also
to appear. It looks like Species.pm tries to validate the species name
against the lineage info already there in the BioSQL DB, and in several
cases, it finds inconsistencies. If we start with the ncbi taxonomy
already loaded in the database, the first error comes much earlier.
I found a thread on the same problem from ~ two years ago
(http://thread.gmane.org/gmane.comp.lang.perl.bio.general/13766/focus=13788),
where the solution recommended was to update bioperl, so I was quite
surprised to find the problem with the version you can see above
(1.5.2_102 bioperl core, 1.5.2_100 bioperl_db). Can someone give me any
hints as to what is going wrong here?
The only workaround we have found so far was to comment out line 174 in
Species.pm:
$self->throw("The supplied lineage does not start near '$name' (I was
supplied '".join(" | ", @vals)."')");
After doing so, load_seqdatabase.pl runs for several hours (until it
evetually crashes; I haven´t found out yet why), but proceeds really
slowly. I also found some info on this for Pg and Oracle in the mailing
list, but has anyone some approximate numbers for MySQL, how long should
a first Swissprot load take?
Would be grateful to hear about your ideas / experiences on these issues!
Bank Beszteri
Bioinformatics / Scientific Computing
Alfred Wegener Institute for Polar and Marine Research
Am Handelshafen 12.
27570 Bremerhaven
Germany
More information about the Bioperl-l
mailing list