[Bioperl-l] loading data into bioperl-db
Hilmar Lapp
hlapp at gnf.org
Wed Jun 4 10:44:37 EDT 2003
On Wednesday, June 4, 2003, at 04:17 AM, Michael Thon wrote:
> Greetings - I've been tinkering with bioperl-db and I'm having trouble
> loading data into the database. Sequences that I download from genbank
> in genbank format load ok but sequences in fasta format do not. I
> cannot load my own sequences in fasta format or genbank format. an
> example session with load_seqdatabase.pl where I try to load my own
> data
> in genbank format is shown below. I first converted my sequences to
> genbank format from fasta format priot ro running load_seqdatabase.pl.
How did you do that? Using bioperl SeqIO?
> I suspect that:
>
> 1) Fasta formated sequences are not supported.
Every format that is supported by bioperl Bio::SeqIO is theoretically
also supported by bioperl-db, because it serializes objects returned by
the SeqIO parser selected by the user.
You can supply the input format to load_seqdatabase.pl using the
--format commandline option. (--format fasta obviously selects FASTA
format)
>
> 2) my own genbank formatted files are not properly populated.
This may easily be an issue with how you created those files. There are
two things to keep in mind if you produce formats in a 'home-grown'
way. First, the input is going to be parsed with a Bio::SeqIO parser
and hence is subject to any constraints that parser is under, and
second, biosql (and therefore also bioperl-db) makes certain
assumptions about the uniqueness of identifiers and accession numbers.
In particular, for your case the combination of
(accession,version,namespace) is assumed to be unique. Also the
primary_id is assumed to be unique. (There was a discussion on biosql-l
a while ago on whether or not primary_id should be unique only within a
namespace or by itself. This distinction is irrelevant to your problem,
because you didn't use different namespaces anyway.)
These assumptions are normally met with 'standard' input parsed by
Bio::SeqIO. Most SeqIO parsers, including fasta, don't set primary_id()
unless they have a good guess for it, like the GI number assigned by
the genbank parser.
>
> Is anyone else out there using bioperl-db in their research?
>
> Thanks
> Mike
>
>
>
> $load_seqdatabase.pl --host localhost --format fasta --dbname myseqdb
> --dbuser biosql N_crassa_3_protein.genbank
> Loading N_crassa_3_protein.genbank ...
> DBD::mysql::st execute failed: Duplicate entry 'unknown-1-0' for key 2
> at /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BaseDriver.pm line 922,
> <GEN0> line 1.
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
> ("LOCUS","LOCUS","unknown","gnl|NCSU_FGL|NCU10032.1 31
> aa linear
> UNK ","0","") FKs (1,<NULL>)
> Duplicate entry 'unknown-1-0' for key 2
> ---------------------------------------------------
This means (unknown,1,0) for (accession,namespace,version) was already
in the database, and a look-up for "LOCUS" as the identifier failed.
The values printed in between the parentheses map to the following
attributes, in the same order:
"display_id", "primary_id", "accession_number", "description",
"version", "division"
(The FKs mean namespace id 1 and no taxon.) This means that all your
identifier/accession information ended up in the description, instead
of accession_number, which is how it came out of the SeqIO genbank
parser with your file as input. Could you send me one entry of that
file? It'd be interesting to see what made the parser produce such
results.
Here is what you want to do.
- If your FASTA-formatted files contain a token after the '>'
character, you should be able to load them right away. Supply --format
fasta and send the error output if there is an error.
- I'd guess your files look like the following:
>gnl|NCSU_FGL|NCU10032.1
DFGDHGDYLILPA ...
I assume you don't really want 'gnl|NCSU_FGL|NCU10032.1' as your
accession number, but rather 'NCU10032', with version being 1. You
could a) give up on the version and reformat the file using sed (or
perl for that matter):
sed -e 's/^>.*|\([^\s]*\)/>\1/' < file.fasta > new.fasta
or b), you write a sequence processor and supply it to
load_seqdatabase.pl. Check out the POD of Bio::Seq::BaseSeqProcessor,
and the --pipeline commandline option of load_seqdatabase.pl. You could
then also automatically set the namespace to NCSU_FGL, and attach a
species object. Since the BaseSeqProcessor implements the whole
framework, you don't need to implement much in your module, just
process_seq().
Hope this helps to get further.
-hilmar
> Could not store unknown:
> ------------- EXCEPTION -------------
> MSG: create: object (Bio::Seq) failed to insert or to be found by
> unique
> key
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:206
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:249
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/Persistent/
> PersistentObject.pm:266
> STACK (eval) /home/mthon/bin/load_seqdatabase.pl:437
> STACK toplevel /home/mthon/bin/load_seqdatabase.pl:421
>
> --------------------------------------
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list