[BioSQL-l] load_seqdatabase fails when loading refseq plant files

Mike Muratet muratem at eng.uah.edu
Mon Aug 14 16:55:45 UTC 2006



On Fri, 11 Aug 2006, Angel Pizarro wrote:

> Date: Fri, 11 Aug 2006 14:57:35 -0400
> From: Angel Pizarro <angel at mail.med.upenn.edu>
> To: BioSQL <biosql-l at lists.open-bio.org>, Bioperl <bioperl-l at bioperl.org>
> Subject: Re: [BioSQL-l] load_seqdatabase fails when loading refseq plant files
> 
> Glad I am not the only one that ran into this problem! Mike, I had
> reported this issue a few emails back and have provided the list with an
> example file for testing, so it should be resolved soon.
>

I must have missed it. Sorry.

> FYI, you are correct that CRC is computed on load to determine if two
> pub references are in fact the same. This is a feature to save database
> space. The expected behaviour would be for the subsequent entries with
> the same CRC reference should have an FK to the originating reference
> entry, and not insert a duplicate row into the reference table.
>
> FYI #2, the --safe option explicitly states that it will continue to
> process records after errors BUT do a roll-back at the end of the run.
> This is to gather all of your errors in one shot, as opposed to fixing a
> record, starting, error, fix, etc ,.
>
> If you are impatient and do not care about references, you have three
> choices.
> 1) drop the unique constraint on reference.crc (this will cause dups in
> reference and you can not go back to a unique CRC without some major SQL
> data migration routine to fix FK's and delete the dups.
>
> 2) filter your records to not contain reference information
>
> 3) alter load_seqdatabase to not enter reference information. This would
> be in the Bio::AnnotationCollection object:
>
>   $seq->annotation()->remove_Annotations('reference');
>
> The above command inserted someplace in the script line ~575 should do
> the trick. Obviously this means that all reference information is not
> loaded into the DB at all.
>

I do need to get something working, and the references are not critical to 
the application, so I will probably alter load_seqdatabase.

Thanks for the help!

Cheers

Mike


> -angel
>
> On Fri, 2006-08-11 at 11:10 -0500, Mike Muratet wrote:
>> Hello all
>>
>> I am using biosql-schema/bioperl-db to load Refseq entries into a biosql
>> database. I don't see any version info in the files, but I downloaded
>> everything in the last month or so and everything passed all the tests
>> when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1,
>> DBD-mysql-3.006. I was loading plant file from Refseq rel 18:
>>
>> load_seqdatabase.pl  --dbname biosql
>> --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz
>>
>> and it crashed after about 30K of 60K records:
>>
>> at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl
>> line 633
>>
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
>> were ("","Direct Submission","Submitted (01-JUL-2004) National Center for
>> Biotechnology Information, National Institutes of Health, Bethesda 20894,
>> United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs
>> (<NULL>)
>> Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3
>> ---------------------------------------------------
>> Could not store XM_472403:
>> ------------- EXCEPTION  -------------
>> MSG: create: object (Bio::Annotation::Reference) failed to insert or to be
>> found by unique key
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254
>> STACK Bio::DB::Persistent::PersistentObject::store
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272
>> STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216
>> t
>>
>> I traced the error back through the source and database and found that
>> XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type,
>> but only the last one crashed the script (in spite of --safe).
>>
>> Should there be more info included in the CRC field? I am weak when
>> it comes to RDBMs, but looking at the schema, I would guess that the CRC field
>> was added to make an otherwise degenerate key unique. Would it help to add
>> more fields to the CRC, or another key? The former might be done without
>> have to change a lot of code.
>>
>> Thanks
>>
>> Mike
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>



More information about the BioSQL-l mailing list