[Bioperl-l] loading data into bioperl-db

Wed Jun 4 10:44:37 EDT 2003

On Wednesday, June 4, 2003, at 04:17  AM, Michael Thon wrote:

> Greetings - I've been tinkering with bioperl-db and I'm having trouble
> loading data into the database.  Sequences that I download from genbank
> in genbank format load ok but sequences in fasta format do not.  I
> cannot load my own sequences in fasta format or genbank format. an
> example session with load_seqdatabase.pl where I try to load my own  
> data
> in genbank format is shown below.  I first converted my sequences to
> genbank format from fasta format priot ro running load_seqdatabase.pl.

How did you do that? Using bioperl SeqIO?

>  I suspect that:
>
> 1) Fasta formated sequences are not supported.

Every format that is supported by bioperl Bio::SeqIO is theoretically  
also supported by bioperl-db, because it serializes objects returned by  
the SeqIO parser selected by the user.

You can supply the input format to load_seqdatabase.pl using the  
--format commandline option. (--format fasta obviously selects FASTA  
format)

>
> 2) my own genbank formatted files are not properly populated.

This may easily be an issue with how you created those files. There are  
two things to keep in mind if you produce formats in a 'home-grown'  
way. First, the input is going to be parsed with a Bio::SeqIO parser  
and hence is subject to any constraints that parser is under, and  
second, biosql (and therefore also bioperl-db) makes certain  
assumptions about the uniqueness of identifiers and accession numbers.

In particular, for your case the combination of  
(accession,version,namespace) is assumed to be unique. Also the  
primary_id is assumed to be unique. (There was a discussion on biosql-l  
a while ago on whether or not primary_id should be unique only within a  
namespace or by itself. This distinction is irrelevant to your problem,  
because you didn't use different namespaces anyway.)

These assumptions are normally met with 'standard' input parsed by  
Bio::SeqIO. Most SeqIO parsers, including fasta, don't set primary_id()  
unless they have a good guess for it, like the GI number assigned by  
the genbank parser.

>
> Is anyone else out there using bioperl-db in their research?
>
> Thanks
> Mike
>
>
>
> $load_seqdatabase.pl --host localhost --format fasta --dbname myseqdb
> --dbuser biosql N_crassa_3_protein.genbank
> Loading N_crassa_3_protein.genbank ...
> DBD::mysql::st execute failed: Duplicate entry 'unknown-1-0' for key 2
> at /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BaseDriver.pm line 922,
> <GEN0> line 1.
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
> ("LOCUS","LOCUS","unknown","gnl|NCSU_FGL|NCU10032.1           31
> aa            linear
> UNK ","0","") FKs (1,<NULL>)
> Duplicate entry 'unknown-1-0' for key 2
> ---------------------------------------------------

This means (unknown,1,0) for (accession,namespace,version) was already  
in the database, and a look-up for "LOCUS" as the identifier failed.

The values printed in between the parentheses map to the following  
attributes, in the same order:

"display_id", "primary_id", "accession_number",	"description",  
"version", "division"

(The FKs mean namespace id 1 and no taxon.) This means that all your  
identifier/accession information ended up in the description, instead  
of accession_number, which is how it came out of the SeqIO genbank  
parser with your file as input. Could you send me one entry of that  
file? It'd be interesting to see what made the parser produce such  
results.

Here is what you want to do.

	- If your FASTA-formatted files contain a token after the '>'  
character, you should be able to load them right away. Supply --format  
fasta and send the error output if there is an error.

	- I'd guess your files look like the following:

	>gnl|NCSU_FGL|NCU10032.1
	DFGDHGDYLILPA ...

	I assume you don't really want 'gnl|NCSU_FGL|NCU10032.1' as your  
accession number, but rather 'NCU10032', with version being 1. You  
could a) give up on the version and reformat the file using sed (or  
perl for that matter):

	    sed -e 's/^>.*|\([^\s]*\)/>\1/' < file.fasta > new.fasta

	or b), you write a sequence processor and supply it to  
load_seqdatabase.pl. Check out the POD of Bio::Seq::BaseSeqProcessor,  
and the --pipeline commandline option of load_seqdatabase.pl. You could  
then also automatically set the namespace to NCSU_FGL, and attach a  
species object. Since the BaseSeqProcessor implements the whole  
framework, you don't need to implement much in your module, just  
process_seq().

Hope this helps to get further.

	-hilmar

> Could not store unknown:
> ------------- EXCEPTION  -------------
> MSG: create: object (Bio::Seq) failed to insert or to be found by  
> unique
> key
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:206
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:249
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/ 
> PersistentObject.pm:266
> STACK (eval) /home/mthon/bin/load_seqdatabase.pl:437
> STACK toplevel /home/mthon/bin/load_seqdatabase.pl:421
>
> --------------------------------------
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------