[Bioperl-l] loading data into bioperl-db
Hilmar Lapp
hlapp at gnf.org
Thu Jun 5 09:49:04 EDT 2003
On Thursday, June 5, 2003, at 03:03 AM, Michael Thon wrote:
>
> into genbank format using SeqIO, the file looks like:
>
> LOCUS NCU10032.1 31 aa linear UNK
> DEFINITION NCU10032.1 hypothetical protein (301 - 1378)
> ACCESSION unknown
^^^^^^^^
This will give rise to trouble.
> FEATURES Location/Qualifiers
> ORIGIN
> 1 mtrqsiqsyr nrglggtrkm flyyffnylg *
> //
>
> and when I try to load into the database I get errors like:
>
> -- WARNING --
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
> ("NCU09800.1","","unknown","NCU09800.1 hypothetical protein (5834 -
> 4583)","0","linear") FKs (1,<NULL>)
> Duplicate entry 'unknown-1-0' for key 2
> --
> DBD::mysql::st execute failed: Duplicate entry 'unknown-1-0' for key 2
> at /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BaseDriver.pm line 922,
> <GEN0> line 14741.
>
> If, during conversion of this file to genbank format, I specify an
> accession number, then the sequences will load. It looks like an
> accession number is required by the database and/or loading script.
Accession is a mandatory field in the schema. What spelled the trouble
here is that during conversion 'unknown' got assigned to every
accession number.
> It also looks to me like when sequences are read by
> Bio::SeqIO::fasta the
> accession number is set to 'unknown'
Seems so. We need to check this. If it's true I'd consider it a bug.
Leaving it undefined wouldn't solve your problem though.
> Is there ever a case where
> Bio::SeqIO::fasta will parse a sequence header like :
>
>> gi|30419336|gb|CD037498.1|CD037498 mgsu014xP21f.b Magnaporthe grisea
>
> and read the namespace, accession, version etc from it?
No. Bioperl itself does not interpret the identifier token, especially
given the fact that there are plenty of ways in which people convolute
information here, and that it is relatively simple to apply whatever
extraction is suitable in 1 or 2 lines of perl.
>
> So, I've been able to load my sequences by making sure they have an
> accession number. Eventually I'll write a BaseSeqProcessor module to
> error-check my sequences at loading time.
Right. The reason I wrote this framework in the first place is that I
use it to massage various attributes between when the object comes out
of the parser and before it sees bioperl-db).
>
>
> Next things for me to figure out are the query system and
> updating/changing sequences that are already in the database.
> Thanks for your help
Sure, you're welcome. Good luck :-)
-hilmar
> ...I'll be back!
> Mike
>
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list