[Bioperl-l] Indexing large databases / BioSQL

Mon Apr 7 12:32:45 UTC 2008

The warnings are something that we still need to resolve, but the only  
fix I can think of likely breaks backward compatibility with older  
bioperl-db installations (i.e. storing the given scientific name  
instead of the binomial name, which is used as a fallback when no  
taxid is found).  There is a full explanation here:

http://bugzilla.open-bio.org/show_bug.cgi?id=2092

Anyway, I think it needs further testing when someone, likely Hilmar  
or I, have time.

chris

On Apr 7, 2008, at 6:46 AM, Bánk Beszteri wrote:

> Hi Hilmar,
>
> it was important to understand that the inconsistency in taxon names  
> is apparently only between the Swissprot entries with "non-standard"  
> names and the contents of the taxonomy tables and that it is best to  
> use a pre-loaded taxonomy, thanks for that! We have now updated to  
> bioperl-live (and bp-db-live, too) and load_seqdatabase.pl seems to  
> have loaded everything OK in ~26 hours (with many of the "The  
> supplied lineage does not start near..." warnings, but no other  
> problems). Our next test is to try to load trembl (will try to do  
> this in parallel in multiple chunks), hope it will work just as  
> nicely!
>
> Thanks for your tips & insights!
>
> Bank
>
> Hilmar Lapp wrote:
>
>>
>> On Apr 1, 2008, at 8:31 AM, Bánk Beszteri wrote:
>>
>>> [...] So next we started to test BioSQL, by trying to load just   
>>> Swissprot in a MySQL DB first, like:
>>>
>>> load_seqdatabase.pl --host mysql.awi.de --dbname biosql2 --dbuser   
>>> xyz --dbpass abc --driver mysql --namespace uniprot_sprot -- 
>>> format  swiss uniprot_sprot.dat
>>>
>>> Here we get an error message
>>>
>>> ###########################################
>>>
>>> Loading /biodb/spinkern/uniprot_sprot.dat ...
>>> Could not store Q6DAH5:
>>> ------------- EXCEPTION: Bio::Root::Exception -------------
>>> MSG: The supplied lineage does not start near 'Erwinia carotovora   
>>> subsp. atroseptica' (I was supplied 'Erwinia carotovora subsp. |   
>>> Pectobacterium | Enterobacteriaceae | Enterobacteriales |   
>>> Gammaproteobacteria | Proteobacteria | Bacteria')
>>> STACK: Error::throw
>>> STACK: Bio::Root::Root::throw /biodb/spinkern/bioperl-1.5/  
>>> bioperl-1.5.2_102/Bio/Root/Root.pm:359
>>> STACK: Bio::Species::classification /biodb/spinkern/bioperl-1.5/  
>>> bioperl-1.5.2_102/Bio/Species.pm:174
>>> STACK: Bio::DB::Persistent::PersistentObject::AUTOLOAD /biodb/  
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/ 
>>> PersistentObject.pm: 552
>>> STACK: Bio::DB::BioSQL::SpeciesAdaptor::populate_from_row /biodb/  
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SpeciesAdaptor.pm:281
>>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object /  
>>> biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/  
>>> BasePersistenceAdaptor.pm:1305
>>> STACK:   
>>> Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key / 
>>> biodb/ spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/  
>>> BasePersistenceAdaptor.pm:973
>>> STACK:  
>>> Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /  
>>> biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/  
>>> BasePersistenceAdaptor.pm:852
>>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /biodb/  
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/  
>>> BasePersistenceAdaptor.pm:182
>>> STACK: Bio::DB::Persistent::PersistentObject::create /biodb/  
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/ 
>>> PersistentObject.pm: 244
>>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /biodb/  
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/  
>>> BasePersistenceAdaptor.pm:169
>>> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /biodb/  
>>> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/  
>>> BasePersistenceAdaptor.pm:251
>>> STACK: Bio::DB::Persistent::PersistentObject::store /biodb/ 
>>> spinkern/ bioperl-db-1.5.2_100/Bio/DB/Persistent/ 
>>> PersistentObject.pm:271
>>> STACK: load_seqdatabase.pl:622
>>> -----------------------------------------------------------
>>>
>>> at load_seqdatabase.pl line 635
>>>
>>> ############################################
>>>
>>> or similar, depending on whether we use a pre-loaded ncbi  
>>> taxonomy  or not
>>
>>
>> I recommend to always use a pre-loaded NCBI taxonomy unless you  
>> know  there are only a few organisms that are straightforward (for  
>> the  parser, that is).
>>
>>> , and which Swissprot release we are trying to load. It often  
>>> seems  to come from sg. like here, subsp. or other special  
>>> addition to the  species line; but alternative genus names and  
>>> other curious things  also to appear. It looks like Species.pm  
>>> tries to validate the  species name against the lineage info  
>>> already there in the BioSQL  DB, and in several cases, it finds  
>>> inconsistencies.
>>
>>
>> It actually happens upon a successful lookup when the species  
>> object  is populated from the database.
>>
>>> [...]
>>> The only workaround we have found so far was to comment out line   
>>> 174 in Species.pm:
>>>
>>> $self->throw("The supplied lineage does not start near '$name' (I   
>>> was supplied '".join(" | ", @vals)."')");
>>
>>
>> That should be OK if you work with a pre-loaded taxonomy. It's  
>> sort  of a sanity check that should catch a parser having messed up  
>> a  species. If you use a pre-loaded NCBI taxonomy the results of  
>> the  species parsing don't matter in all details so long as the  
>> NCBI  taxonID is parsed out correctly, and then found in the  
>> database.
>>
>> Note that this actually a warn() in the main trunk version of   
>> BioPerl, so you might want to upgrade to that (or change throw()  
>> to  warn() in your version). You still get the records flagged with  
>> that,  but it isn't an exception.
>>
>>>
>>> After doing so, load_seqdatabase.pl runs for several hours (until   
>>> it evetually crashes; I haven´t found out yet why), but proceeds   
>>> really slowly.
>>
>>
>> It should certainly *not* crash. Note also that you can supply -- 
>> safe  on the command line, in which case the script will continue  
>> with the  next record if one fails to load for whatever reason.
>>
>> You will want to adjust the width constraint of dbxref.accession,  
>> for  example to 128 chars. This will also be fixed for BioSQL 1.0.1.
>> See http://bugzilla.open-bio.org/show_bug.cgi?id=2474
>>
>>
>>> I also found some info on this for Pg and Oracle in the mailing   
>>> list, but has anyone some approximate numbers for MySQL, how long   
>>> should a first Swissprot load take?
>>
>>
>> Possibly around 20 hours according to Erik Rijkers:
>> See http://lists.open-bio.org/pipermail/bioperl-l/2008-March/027427.html
>>
>> You can use the --logchunks N option to have it print out  
>> performance  statistics every N records.
>>
>> Hope this helps,
>>
>>    -hilmar
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign