[Bioperl-l] Indexing large databases / BioSQL

Tue Apr 1 12:31:49 UTC 2008

Dear list,

we have recently started to try to find a solution for indexing large 
sequence databases / flat files for a java project, and because we ran 
into problems using biojava, and because both the OBDA and BioSQL ways 
seem to be compatible across bio~ projects, we also started to 
experiment with bioperl. It looks like this should work fine, but we had 
a couple of problems here, too. Perhaps some of you can give me hint 
what we are doing wrong!

The first thing we tried was to use Bio::DB::Flat for indexing a TrEMBL 
flat file (~ 12 GB); but it seems we haven´t got a machine with enough 
memory to be able to handle this. (Perhaps you would be using the "bdb" 
style index in such a case in bioperl, but this apparently doesn´t work 
with biojava, so we had to stick with "flat"). So next we started to 
test BioSQL, by trying to load just Swissprot in a MySQL DB first, like:

load_seqdatabase.pl --host mysql.awi.de --dbname biosql2 --dbuser xyz 
--dbpass abc --driver mysql --namespace uniprot_sprot --format swiss 
uniprot_sprot.dat

Here we get an error message

###########################################

Loading /biodb/spinkern/uniprot_sprot.dat ...
Could not store Q6DAH5:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: The supplied lineage does not start near 'Erwinia carotovora subsp. 
atroseptica' (I was supplied 'Erwinia carotovora subsp. | Pectobacterium 
| Enterobacteriaceae | Enterobacteriales | Gammaproteobacteria | 
Proteobacteria | Bacteria')
STACK: Error::throw
STACK: Bio::Root::Root::throw 
/biodb/spinkern/bioperl-1.5/bioperl-1.5.2_102/Bio/Root/Root.pm:359
STACK: Bio::Species::classification 
/biodb/spinkern/bioperl-1.5/bioperl-1.5.2_102/Bio/Species.pm:174
STACK: Bio::DB::Persistent::PersistentObject::AUTOLOAD 
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:552 

STACK: Bio::DB::BioSQL::SpeciesAdaptor::populate_from_row 
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SpeciesAdaptor.pm:281
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object 
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:1305 

STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key 
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:973 

STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key 
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:852 

STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create 
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:182 

STACK: Bio::DB::Persistent::PersistentObject::create 
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:244 

STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create 
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169 

STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store 
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 

STACK: Bio::DB::Persistent::PersistentObject::store 
/biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271 

STACK: load_seqdatabase.pl:622
-----------------------------------------------------------

at load_seqdatabase.pl line 635

############################################

or similar, depending on whether we use a pre-loaded ncbi taxonomy or 
not, and which Swissprot release we are trying to load. It often seems 
to come from sg. like here, subsp. or other special addition to the 
species line; but alternative genus names and other curious things also 
to appear. It looks like Species.pm tries to validate the species name 
against the lineage info already there in the BioSQL DB, and in several 
cases, it finds inconsistencies. If we start with the ncbi taxonomy 
already loaded in the database, the first error comes much earlier.

I found a thread on the same problem from ~ two years ago 
(http://thread.gmane.org/gmane.comp.lang.perl.bio.general/13766/focus=13788), 
where the solution recommended was to update bioperl, so I was quite 
surprised to find the problem with the version you can see above 
(1.5.2_102 bioperl core, 1.5.2_100 bioperl_db). Can someone give me any 
hints as to what is going wrong here?

The only workaround we have found so far was to comment out line 174 in 
Species.pm:

$self->throw("The supplied lineage does not start near '$name' (I was 
supplied '".join(" | ", @vals)."')");

After doing so, load_seqdatabase.pl runs for several hours (until it 
evetually crashes; I haven´t found out yet why), but proceeds really 
slowly. I also found some info on this for Pg and Oracle in the mailing 
list, but has anyone some approximate numbers for MySQL, how long should 
a first Swissprot load take?

Would be grateful to hear about your ideas / experiences on these issues!

Bank Beszteri

Bioinformatics / Scientific Computing
Alfred Wegener Institute for Polar and Marine Research
Am Handelshafen 12.
27570 Bremerhaven
Germany