[Bioperl-l] Indexing large databases / BioSQL
Bánk Beszteri
Bank.Beszteri at awi.de
Mon Apr 28 12:18:20 UTC 2008
Dear BioSQL / bioperl-db-ists,
I would like to share my experiences with trying to load uniprot_trembl
into a BioSQL db, and also to ask a couple of questions; perhaps some of
you know the problems I encountered. I used bioperl-live and
bioperl-db-live as of 2008-04-03 and uniprot_trembl.dat as of
2008-04-04. The command was like
load_seqdatabase.pl --safe --logchunk 1000 --host dbserv --dbname abc
--dbuser efg --dbpass xyz --driver mysql --namespace uniprot_trembl
--format embl uniprot_trembl.dat
although I split the dat file into 10 chunks and started them parallel
to make it faster. This did not go quite as smoothly as Swissprot did.
In the end, it seems to have loaded 5022284 entries of the 5443284 which
appear to be there in the input file (when counting with grep -c "ID ").
Besides the harmless taxonomy warnings which also appear with Swissprot
(and have been discussed about here a couple of weeks ago and also
earlier), there came a couple of more serious errors. Perhaps some of
you know them already:
First of all, the below error seems to lead to a crash, in spite of --safe:
>>>
------------- EXCEPTION -------------
MSG: A1XDT7 seems to have an invalid species classification.
STACK Bio::SeqIO::embl::_read_EMBL_Species
/home/biocl/bbeszter/lib/bioperl-live/bioperl-live/Bio/SeqIO/embl.pm:108
7
STACK Bio::SeqIO::embl::next_seq
/home/biocl/bbeszter/lib/bioperl-live/bioperl-live/Bio/SeqIO/embl.pm:320
STACK toplevel
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/scripts/biosql/load_seqdatabase.pl:634
-------------------------------------
Command exited with non-zero status 255
<<<
What this is about is NCBI Tax_ID:435 (Acetobacter aceti; it has some 30
synonyms in my DB, too), which, to me, looks like a completely normal
taxon: I could follow its taxonomy up to the root in my NCBI taxonomy in
the BioSQL DB I used. I don´t know if someone else has seen / can
reproduce the problem, or should I think about some problem with my
taxonomy db? Besides, is it the expected behaviour from
load_seqdatabase.pl to die upon this error?
###################
The other problems did not lead to a crash, only to a failure to load
the sequence, which would be what I´d expect with --safe. The first type
of errors looks like
>>>
Could not store Q49I36:
------------- EXCEPTION -------------
MSG: Unique key query in Bio::DB::BioSQL::SpeciesAdaptor returned 2 rows
instead of 1. Query was [name_class="scientific
name",binomial="Onchocerca volvulus"]
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:958
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:854
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:182
STACK Bio::DB::Persistent::PersistentObject::create
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/Persistent/PersistentObject.pm:244
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/Persistent/PersistentObject.pm:271
STACK (eval)
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/scripts/biosql/load_seqdatabase.pl:630
STACK toplevel
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/scripts/biosql/load_seqdatabase.pl:612
-------------------------------------
<<<
In this particular case, "Onchocerca volvulus" does indeed have two
taxon_ids in my DB (6282 and 563188, of which only the first one is
returned by a web search at NCBI taxonomy); but the same thing happened
with a number of other taxa (followed by how many times the above error
was caused by the particular taxa):
Wolbachia pipientis 64
Hemerocallis sp. 1
Hypsiglena torquata 3
Salmonella enterica 1211
Burkholderia sp. 31
Streptococcus sp. 4
Rhizobium sp. 600
Nostoc sp. 19
Drosophila sp. 18
Onchocerca volvulus 62
Atlapetes schistaceus 4
Symbiodinium sp. 3
Escherichia coli 7421
Hieraaetus fasciatus 4
Borrelia burgdorferi group 1
Pseudomonas sp. 29
Rotavirus A 1076
Gorilla gorilla 746
Rana plancyi 14
unclassified sequences 1
(This should be 11312 cases altogether, but the list might be incomplete
because I accidentally removed one of my logs, which contained STDOUT
&STDERR ~ for 10 % of the entries)
Again, is this a known problem for some of you, or could there be a
problem with my copy of NCBI taxonomy? I don´t remember having updated
it after the initial upload, so I´m quite surprised by such duplicate
entries....
###################
Type 2 error w/o crash:
>>>
Could not store A5HU09:
------------- EXCEPTION -------------
MSG: create: object (Bio::Species) failed to insert or to be found by
unique key
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK Bio::DB::Persistent::PersistentObject::create
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/Persistent/PersistentObject.pm:244
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/Bio/DB/Persistent/PersistentObject.pm:271
STACK (eval)
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/scripts/biosql/load_seqdatabase.pl:630
STACK toplevel
/home/biocl/bbeszter/lib/bioperl-live/bioperl-db/scripts/biosql/load_seqdatabase.pl:612
<<<
This particular record has the NCBI_TaxID 44271, which looks completely
normal in the NCBI taxonomy loaded in my BioSQL DB, but the same problem
appeared in 53 further cases (I could not look into them in detail as
yet to see whether they were all the same species). On the other hand, 7
records which were succesfully loaded have this taxonomy ID in the DB
(44271).
###################
Nr 3 no crash:
>>>
Could not store Q6T859: Unmatched ( in regex; marked by <-- HERE in
m/Camelina microcarpa (Littlepod false flax) ( <-- HERE microcarpa
subsp.\s+/ at
/home/biocl/bbeszter/lib/bioperl-live/bioperl-live/Bio/Species.pm line
466, <GEN0> line 357048.
<<<
This happens in the sub binomial in Species.pm using the option "FULL",
which requests to also return subspecies. I have not looked much deeper
into this yet, but is it possible that there is a parsing problem with
multi-line species strings? In the above case the OS field in
uniprot_trembl.dat looks like
OS Camelina microcarpa (Littlepod false flax) (Camelina microcarpa subsp.
OS sylvestris).
###################
I´m still looking for where the remaining records disappeared: of the
421000 records not showing up in the DB, I could find these:
crasher (Tax_ID=435): 45 entries
problem 1 ("MSG: Unique key query in Bio::DB::BioSQL::SpeciesAdaptor
returned 2 rows instead of 1."): 11312 entries
problem 2 ("MSG: create: object (Bio::Species) failed to insert or to be
found by unique key"): 54 entries
problem 3 ("Unmatched ( in regex"): 28241 entries
381348 still remain... Although these could in principle come from the
first 10 %, for which I don´t have the output, but they don´t seem to:
after restarting that chunk, I get ~ 30 "Could not store" errors.
So the last question: are there any error messages I can expect which
don´t contain "Could not store" and which I thus missed here?
Bank Beszteri
Bioinformatics
Alfred Wegener Institute for Polar and Marine Research
Am Handelshafen 12
27570 Bremerhaven
More information about the Bioperl-l
mailing list