[Bioperl-l] Indexing large databases / BioSQL

Bánk Beszteri Bank.Beszteri at awi.de
Mon Apr 28 12:18:20 UTC 2008


Dear BioSQL / bioperl-db-ists,

I would like  to share my experiences with trying to load uniprot_trembl 
into a BioSQL db, and also to ask a couple of questions; perhaps some of 
you know the problems I encountered. I used bioperl-live and 
bioperl-db-live as of 2008-04-03 and uniprot_trembl.dat as of 
2008-04-04. The command was like

load_seqdatabase.pl --safe --logchunk 1000 --host dbserv --dbname abc 
--dbuser efg --dbpass xyz --driver mysql --namespace uniprot_trembl 
--format embl uniprot_trembl.dat

although I split the dat file into 10 chunks and started them parallel 
to make it faster. This did not go quite as smoothly as Swissprot did. 
In the end, it seems to have loaded 5022284 entries of the 5443284 which 
appear to be there in the input file (when counting with grep -c "ID   ").

Besides the harmless taxonomy warnings which also appear with Swissprot 
(and have been discussed about here a couple of weeks ago and also 
earlier), there came a couple of more serious errors. Perhaps some of 
you know them already:

First of all, the below error seems to lead to a crash, in spite of --safe:

 >>>
------------- EXCEPTION -------------
MSG: A1XDT7 seems to have an invalid species classification.
STACK Bio::SeqIO::embl::_read_EMBL_Species 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-live/Bio/SeqIO/embl.pm:108
7
STACK Bio::SeqIO::embl::next_seq 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-live/Bio/SeqIO/embl.pm:320
STACK toplevel 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/scripts/biosql/load_seqdatabase.pl:634
-------------------------------------

Command exited with non-zero status 255
<<<

What this is about is NCBI Tax_ID:435 (Acetobacter aceti; it has some 30 
synonyms in my DB, too), which, to me, looks like a completely normal 
taxon: I could follow its taxonomy up to the root in my NCBI taxonomy in 
the BioSQL DB I used. I don´t know if someone else has seen / can 
reproduce the problem, or should I think about some problem with my 
taxonomy db? Besides, is it the expected behaviour from 
load_seqdatabase.pl to die upon this error?

###################

The other problems did not lead to a crash, only to a failure to load 
the sequence, which would be what I´d expect with --safe. The first type 
of errors looks like

 >>>
Could not store Q49I36:
------------- EXCEPTION -------------
MSG: Unique key query in Bio::DB::BioSQL::SpeciesAdaptor returned 2 rows 
instead of 1. Query was [name_class="scientific 
name",binomial="Onchocerca volvulus"]
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:958
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:854
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:182
STACK Bio::DB::Persistent::PersistentObject::create 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/Persistent/PersistentObject.pm:244
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/Persistent/PersistentObject.pm:271
STACK (eval) 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/scripts/biosql/load_seqdatabase.pl:630
STACK toplevel 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/scripts/biosql/load_seqdatabase.pl:612
-------------------------------------
<<<

In this particular case, "Onchocerca volvulus" does indeed have two 
taxon_ids in my DB (6282 and 563188, of which only the first one is 
returned by a web search at NCBI taxonomy); but the same thing happened 
with a number of other taxa (followed by how many times the above error 
was caused by the particular taxa):

Wolbachia pipientis     64
Hemerocallis sp.        1
Hypsiglena torquata     3
Salmonella enterica     1211
Burkholderia sp.        31
Streptococcus sp.       4
Rhizobium sp.   600
Nostoc sp.      19
Drosophila sp.  18
Onchocerca volvulus     62
Atlapetes schistaceus   4
Symbiodinium sp.        3
Escherichia coli        7421
Hieraaetus fasciatus    4
Borrelia burgdorferi group      1
Pseudomonas sp. 29
Rotavirus A     1076
Gorilla gorilla 746
Rana plancyi    14
unclassified sequences  1

(This should be 11312 cases altogether, but the list might be incomplete 
because I accidentally removed one of my logs, which contained STDOUT 
&STDERR ~ for 10 % of the entries)

Again, is this a known problem for some of you, or could there be a 
problem with my copy of NCBI taxonomy? I don´t remember having updated 
it after the initial upload, so I´m quite surprised by such duplicate 
entries....

###################

Type 2 error w/o crash:

 >>>
Could not store A5HU09:
------------- EXCEPTION -------------
MSG: create: object (Bio::Species) failed to insert or to be found by 
unique key
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK Bio::DB::Persistent::PersistentObject::create 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/Persistent/PersistentObject.pm:244
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/Persistent/PersistentObject.pm:271
STACK (eval) 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/scripts/biosql/load_seqdatabase.pl:630
STACK toplevel 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/scripts/biosql/load_seqdatabase.pl:612
<<<

This particular record has the NCBI_TaxID 44271, which looks completely 
normal in the NCBI taxonomy loaded in my BioSQL DB, but the same problem 
appeared in 53 further cases (I could not look into them in detail as 
yet to see whether they were all the same species). On the other hand, 7 
records which were succesfully loaded have this taxonomy ID in the DB 
(44271).

###################

Nr 3 no crash:

 >>>
Could not store Q6T859: Unmatched ( in regex; marked by <-- HERE in 
m/Camelina microcarpa (Littlepod false flax) ( <-- HERE microcarpa 
subsp.\s+/ at 
/home/biocl/bbeszter/lib/bioperl-live/bioperl-live/Bio/Species.pm line 
466, <GEN0> line 357048.
<<<

This happens in the sub binomial in Species.pm using the option "FULL", 
which requests to also return subspecies. I have not looked much deeper 
into this yet, but is it possible that there is a parsing problem with 
multi-line species strings? In the above case the OS field in 
uniprot_trembl.dat looks like

OS   Camelina microcarpa (Littlepod false flax) (Camelina microcarpa subsp.
OS   sylvestris).

###################

I´m still looking for where the remaining records disappeared: of the 
421000 records not showing up in the DB, I could find these:

crasher (Tax_ID=435):   45 entries
problem 1 ("MSG: Unique key query in Bio::DB::BioSQL::SpeciesAdaptor 
returned 2 rows instead of 1."): 11312 entries
problem 2 ("MSG: create: object (Bio::Species) failed to insert or to be 
found by unique key"): 54 entries
problem 3 ("Unmatched ( in regex"): 28241 entries

381348 still remain... Although these could in principle come from the 
first 10 %, for which I don´t have the output, but they don´t seem to: 
after restarting that chunk, I get ~ 30 "Could not store" errors.

So the last question: are there any error messages I can expect which 
don´t contain "Could not store" and which I thus missed here?


Bank Beszteri



Bioinformatics
Alfred Wegener Institute for Polar and Marine Research
Am Handelshafen 12
27570 Bremerhaven



More information about the Bioperl-l mailing list