[Bioperl-l] loading data into bioperl-db

Hilmar Lapp hlapp at gnf.org
Thu Jun 5 09:49:04 EDT 2003

On Thursday, June 5, 2003, at 03:03  AM, Michael Thon wrote:

> into genbank format using SeqIO, the file looks like:
> LOCUS       NCU10032.1                31 aa            linear   UNK
> DEFINITION  NCU10032.1 hypothetical protein (301 - 1378)
> ACCESSION   unknown

This will give rise to trouble.

> FEATURES             Location/Qualifiers
>         1 mtrqsiqsyr nrglggtrkm flyyffnylg *
> //
> and when I try to load into the database I get errors like:
> -- WARNING --
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
> ("NCU09800.1","","unknown","NCU09800.1 hypothetical protein (5834 -
> 4583)","0","linear") FKs (1,<NULL>)
> Duplicate entry 'unknown-1-0' for key 2
> --
> DBD::mysql::st execute failed: Duplicate entry 'unknown-1-0' for key 2
> at /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BaseDriver.pm line 922,
> <GEN0> line 14741.
> If, during conversion of this file to genbank format, I specify an
> accession number, then the sequences will load. It looks like an
> accession number is required by the database and/or loading script.

Accession is a mandatory field in the schema. What spelled the trouble 
here is that during conversion 'unknown' got assigned to every 
accession number.

>   It also looks to me like when sequences are read by 
> Bio::SeqIO::fasta the
> accession number is set to 'unknown'

Seems so. We need to check this. If it's true I'd consider it a bug. 
Leaving it undefined wouldn't solve your problem though.

>  Is there ever a case where
> Bio::SeqIO::fasta will parse a sequence header like :
>> gi|30419336|gb|CD037498.1|CD037498 mgsu014xP21f.b Magnaporthe grisea
> and read the namespace, accession, version etc from it?

No. Bioperl itself does not interpret the identifier token, especially 
given the fact that there are plenty of ways in which people convolute 
information here, and that it is relatively simple to apply whatever 
extraction is suitable in 1 or 2 lines of perl.

> So, I've been able to load my sequences by making sure they have an
> accession number.  Eventually I'll write a BaseSeqProcessor module to
> error-check my sequences at loading time.

Right. The reason I wrote this framework in the first place is that I 
use it to massage various attributes between when the object comes out 
of the parser and before it sees bioperl-db).

> Next things for me to figure out are the query system and
> updating/changing sequences that are already in the database.
> Thanks for your help

Sure, you're welcome. Good luck :-)


> ...I'll be back!
> Mike
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757

More information about the Bioperl-l mailing list