[Bioperl-l] Indexing large databases / BioSQL
Chris Fields
cjfields at uiuc.edu
Mon Apr 7 12:32:45 UTC 2008
The warnings are something that we still need to resolve, but the only
fix I can think of likely breaks backward compatibility with older
bioperl-db installations (i.e. storing the given scientific name
instead of the binomial name, which is used as a fallback when no
taxid is found). There is a full explanation here:
http://bugzilla.open-bio.org/show_bug.cgi?id=2092
Anyway, I think it needs further testing when someone, likely Hilmar
or I, have time.
chris
On Apr 7, 2008, at 6:46 AM, Bánk Beszteri wrote:
> Hi Hilmar,
>
> it was important to understand that the inconsistency in taxon names
> is apparently only between the Swissprot entries with "non-standard"
> names and the contents of the taxonomy tables and that it is best to
> use a pre-loaded taxonomy, thanks for that! We have now updated to
> bioperl-live (and bp-db-live, too) and load_seqdatabase.pl seems to
> have loaded everything OK in ~26 hours (with many of the "The
> supplied lineage does not start near..." warnings, but no other
> problems). Our next test is to try to load trembl (will try to do
> this in parallel in multiple chunks), hope it will work just as
> nicely!
>
> Thanks for your tips & insights!
>
> Bank
>
> Hilmar Lapp wrote:
>
>>
>> On Apr 1, 2008, at 8:31 AM, Bánk Beszteri wrote:
>>
>>> [...] So next we started to test BioSQL, by trying to load just
>>> Swissprot in a MySQL DB first, like:
>>>
>>> load_seqdatabase.pl --host mysql.awi.de --dbname biosql2 --dbuser
>>> xyz --dbpass abc --driver mysql --namespace uniprot_sprot --
>>> format swiss uniprot_sprot.dat
>>>
>>> Here we get an error message
>>>
>>> ###########################################
>>>
>>> Loading /biodb/spinkern/uniprot_sprot.dat ...
>>> Could not store Q6DAH5:
>>> ------------- EXCEPTION: Bio::Root::Exception -------------
>>> MSG: The supplied lineage does not start near 'Erwinia carotovora
>>> subsp. atroseptica' (I was supplied 'Erwinia carotovora subsp. |
>>> Pectobacterium | Enterobacteriaceae | Enterobacteriales |
>>> Gammaproteobacteria | Proteobacteria | Bacteria')
>>> STACK: Error::throw
>>> STACK: Bio::Root::Root::throw /biodb/spinkern/bioperl-1.5/
>>> bioperl-1.5.2_102/Bio/Root/Root.pm:359
>>> STACK: Bio::Species::classification /biodb/spinkern/bioperl-1.5/
>>> bioperl-1.5.2_102/Bio/Species.pm:174
>>> STACK: Bio::DB::Persistent::PersistentObject::AUTOLOAD /biodb/
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/
>>> PersistentObject.pm: 552
>>> STACK: Bio::DB::BioSQL::SpeciesAdaptor::populate_from_row /biodb/
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SpeciesAdaptor.pm:281
>>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object /
>>> biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>>> BasePersistenceAdaptor.pm:1305
>>> STACK:
>>> Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key /
>>> biodb/ spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>>> BasePersistenceAdaptor.pm:973
>>> STACK:
>>> Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /
>>> biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>>> BasePersistenceAdaptor.pm:852
>>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /biodb/
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>>> BasePersistenceAdaptor.pm:182
>>> STACK: Bio::DB::Persistent::PersistentObject::create /biodb/
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/
>>> PersistentObject.pm: 244
>>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /biodb/
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>>> BasePersistenceAdaptor.pm:169
>>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /biodb/
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/
>>> BasePersistenceAdaptor.pm:251
>>> STACK: Bio::DB::Persistent::PersistentObject::store /biodb/
>>> spinkern/ bioperl-db-1.5.2_100/Bio/DB/Persistent/
>>> PersistentObject.pm:271
>>> STACK: load_seqdatabase.pl:622
>>> -----------------------------------------------------------
>>>
>>> at load_seqdatabase.pl line 635
>>>
>>> ############################################
>>>
>>> or similar, depending on whether we use a pre-loaded ncbi
>>> taxonomy or not
>>
>>
>> I recommend to always use a pre-loaded NCBI taxonomy unless you
>> know there are only a few organisms that are straightforward (for
>> the parser, that is).
>>
>>> , and which Swissprot release we are trying to load. It often
>>> seems to come from sg. like here, subsp. or other special
>>> addition to the species line; but alternative genus names and
>>> other curious things also to appear. It looks like Species.pm
>>> tries to validate the species name against the lineage info
>>> already there in the BioSQL DB, and in several cases, it finds
>>> inconsistencies.
>>
>>
>> It actually happens upon a successful lookup when the species
>> object is populated from the database.
>>
>>> [...]
>>> The only workaround we have found so far was to comment out line
>>> 174 in Species.pm:
>>>
>>> $self->throw("The supplied lineage does not start near '$name' (I
>>> was supplied '".join(" | ", @vals)."')");
>>
>>
>> That should be OK if you work with a pre-loaded taxonomy. It's
>> sort of a sanity check that should catch a parser having messed up
>> a species. If you use a pre-loaded NCBI taxonomy the results of
>> the species parsing don't matter in all details so long as the
>> NCBI taxonID is parsed out correctly, and then found in the
>> database.
>>
>> Note that this actually a warn() in the main trunk version of
>> BioPerl, so you might want to upgrade to that (or change throw()
>> to warn() in your version). You still get the records flagged with
>> that, but it isn't an exception.
>>
>>>
>>> After doing so, load_seqdatabase.pl runs for several hours (until
>>> it evetually crashes; I haven´t found out yet why), but proceeds
>>> really slowly.
>>
>>
>> It should certainly *not* crash. Note also that you can supply --
>> safe on the command line, in which case the script will continue
>> with the next record if one fails to load for whatever reason.
>>
>> You will want to adjust the width constraint of dbxref.accession,
>> for example to 128 chars. This will also be fixed for BioSQL 1.0.1.
>> See http://bugzilla.open-bio.org/show_bug.cgi?id=2474
>>
>>
>>> I also found some info on this for Pg and Oracle in the mailing
>>> list, but has anyone some approximate numbers for MySQL, how long
>>> should a first Swissprot load take?
>>
>>
>> Possibly around 20 hours according to Erik Rijkers:
>> See http://lists.open-bio.org/pipermail/bioperl-l/2008-March/027427.html
>>
>> You can use the --logchunks N option to have it print out
>> performance statistics every N records.
>>
>> Hope this helps,
>>
>> -hilmar
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list