[Bioperl-l] Indexing large databases / BioSQL
Bánk Beszteri
Bank.Beszteri at awi.de
Mon Apr 7 11:46:43 UTC 2008
Hi Hilmar,
it was important to understand that the inconsistency in taxon names is
apparently only between the Swissprot entries with "non-standard" names
and the contents of the taxonomy tables and that it is best to use a
pre-loaded taxonomy, thanks for that! We have now updated to
bioperl-live (and bp-db-live, too) and load_seqdatabase.pl seems to have
loaded everything OK in ~26 hours (with many of the "The supplied
lineage does not start near..." warnings, but no other problems). Our
next test is to try to load trembl (will try to do this in parallel in
multiple chunks), hope it will work just as nicely!
Thanks for your tips & insights!
Bank
Hilmar Lapp wrote:
>
> On Apr 1, 2008, at 8:31 AM, Bánk Beszteri wrote:
>
>> [...] So next we started to test BioSQL, by trying to load just
>> Swissprot in a MySQL DB first, like:
>>
>> load_seqdatabase.pl --host mysql.awi.de --dbname biosql2 --dbuser
>> xyz --dbpass abc --driver mysql --namespace uniprot_sprot --format
>> swiss uniprot_sprot.dat
>>
>> Here we get an error message
>>
>> ###########################################
>>
>> Loading /biodb/spinkern/uniprot_sprot.dat ...
>> Could not store Q6DAH5:
>> ------------- EXCEPTION: Bio::Root::Exception -------------
>> MSG: The supplied lineage does not start near 'Erwinia carotovora
>> subsp. atroseptica' (I was supplied 'Erwinia carotovora subsp. |
>> Pectobacterium | Enterobacteriaceae | Enterobacteriales |
>> Gammaproteobacteria | Proteobacteria | Bacteria')
>> STACK: Error::throw
>> STACK: Bio::Root::Root::throw /biodb/spinkern/bioperl-1.5/
>> bioperl-1.5.2_102/Bio/Root/Root.pm:359
>> STACK: Bio::Species::classification /biodb/spinkern/bioperl-1.5/
>> bioperl-1.5.2_102/Bio/Species.pm:174
>> STACK: Bio::DB::Persistent::PersistentObject::AUTOLOAD /biodb/
>> spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm: 552
>> STACK: Bio::DB::BioSQL::SpeciesAdaptor::populate_from_row /biodb/
>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SpeciesAdaptor.pm:281
>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object /
>> biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:1305
>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
>> /biodb/ spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:973
>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /
>> biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:852
>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /biodb/
>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:182
>> STACK: Bio::DB::Persistent::PersistentObject::create /biodb/
>> spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm: 244
>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /biodb/
>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:169
>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /biodb/
>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:251
>> STACK: Bio::DB::Persistent::PersistentObject::store /biodb/spinkern/
>> bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271
>> STACK: load_seqdatabase.pl:622
>> -----------------------------------------------------------
>>
>> at load_seqdatabase.pl line 635
>>
>> ############################################
>>
>> or similar, depending on whether we use a pre-loaded ncbi taxonomy
>> or not
>
>
> I recommend to always use a pre-loaded NCBI taxonomy unless you know
> there are only a few organisms that are straightforward (for the
> parser, that is).
>
>> , and which Swissprot release we are trying to load. It often seems
>> to come from sg. like here, subsp. or other special addition to the
>> species line; but alternative genus names and other curious things
>> also to appear. It looks like Species.pm tries to validate the
>> species name against the lineage info already there in the BioSQL
>> DB, and in several cases, it finds inconsistencies.
>
>
> It actually happens upon a successful lookup when the species object
> is populated from the database.
>
>> [...]
>> The only workaround we have found so far was to comment out line 174
>> in Species.pm:
>>
>> $self->throw("The supplied lineage does not start near '$name' (I
>> was supplied '".join(" | ", @vals)."')");
>
>
> That should be OK if you work with a pre-loaded taxonomy. It's sort
> of a sanity check that should catch a parser having messed up a
> species. If you use a pre-loaded NCBI taxonomy the results of the
> species parsing don't matter in all details so long as the NCBI
> taxonID is parsed out correctly, and then found in the database.
>
> Note that this actually a warn() in the main trunk version of
> BioPerl, so you might want to upgrade to that (or change throw() to
> warn() in your version). You still get the records flagged with that,
> but it isn't an exception.
>
>>
>> After doing so, load_seqdatabase.pl runs for several hours (until it
>> evetually crashes; I haven´t found out yet why), but proceeds really
>> slowly.
>
>
> It should certainly *not* crash. Note also that you can supply --safe
> on the command line, in which case the script will continue with the
> next record if one fails to load for whatever reason.
>
> You will want to adjust the width constraint of dbxref.accession, for
> example to 128 chars. This will also be fixed for BioSQL 1.0.1.
> See http://bugzilla.open-bio.org/show_bug.cgi?id=2474
>
>
>> I also found some info on this for Pg and Oracle in the mailing
>> list, but has anyone some approximate numbers for MySQL, how long
>> should a first Swissprot load take?
>
>
> Possibly around 20 hours according to Erik Rijkers:
> See http://lists.open-bio.org/pipermail/bioperl-l/2008-March/027427.html
>
> You can use the --logchunks N option to have it print out performance
> statistics every N records.
>
> Hope this helps,
>
> -hilmar
More information about the Bioperl-l
mailing list