[Bioperl-l] basic problems with bioperl-db/biosql
Hilmar Lapp
hlapp at gmx.net
Sun Oct 17 00:19:43 EDT 2004
On Thursday, October 14, 2004, at 08:08 AM, Mikko Arvas wrote:
> Hi,
>
> I am trying to get started using bioperl-db, but I am failing
> miserably.
> I got bioperl 1.4 and the latest bioperl-db and biosql tarballs from
> CVS on
> SuSe 8.1.
There is a fix to the GO-format ontology parser that is on the 1.4
branch but not in the 1.4.0 release. If you intend to load GO or
SO/SOFA, you will want that fix therefore upgrade from the 1.4 CVS
branch.
>
> Installation gave one error:
> t/simpleseq.....ok 6/59gzip: t/data/Titin.fasta.gz: No such file or
> directory
> Can't call method "namespace" on an undefined value at t/simpleseq.t
> line
> 48.
Sorry about that, I forgot to add the file. It's in CVS now.
>
> I tried:
>> perl load_seqdatabase.pl --dbname biosql --format fasta test.fasta
> Loading test.fasta ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
> ("NCRA-XX3-01-000002","NCRA-XX3-01-000002","unknown","NCU10033.1
> predicted protein (1437 - 1101)","0","") FKs (9,<NULL>)
> Duplicate entry 'unknown-9-0' for key 2
> ---------------------------------------------------
Technically, this is because a previous entry also had 'unknown' as
accession number and version 0.
Practically, this is not your fault. Basically, this will happen
inevitably if you load fasta format straight away. The reason is that
$seq->accession_number is not set by the bioperl fasta parser (and,
worse yet, defaults to 'unknown'), and $seq->primary_id as well as
$seq->display_id are set to the ID part from the fasta description
line. Biosql mandates through a unique key constraint that the
combination of accession number, version, and namespace be unique,
which it can't be if all accession numbers have the same value.
If your sequences come in fasta format, you almost certainly want to
write your own SequenceProcessor (see POD for
Bio::Factory::SequenceProcessorI) to set the IDs straight. Using
Bio::Seq::BaseSeqProcessor as a base class this is a relatively simple
task. As an example:
package My::FastaSeqProcessor;
use vars qw(@ISA);
use strict;
use Bio::Seq::BaseSeqProcessor;
@ISA = qw(Bio::Seq::BaseSeqProcessor);
sub process_seq{
my ($self,$seq) = @_;
# I don't think the fasta ID qualifies as primary_id
$seq->primary_id(undef);
# we do want to have an accession number
$seq->accession_number($seq->display_id);
# there's many more things you could do here ...
return ($seq); # make sure you return an array!
}
1;
You then pipeline your sequence objects through your module using
--pipeline "My::FastaSeqProcessor".
> Could not store unknown:
> ------------- EXCEPTION -------------
> MSG: You're trying to lie about the length: is 56 but you say 1161
This is caused by the previous problem. The reason is that when the
INSERT fails it tries to find the entry in the database whose existence
caused the unique key violation. In your case, it will be totally
unrelated, and set the sequence length to something that's got nothing
to do with the sequence.
> etc....
>
> It reads always just one sequence to bioentry/biosequence tables in
> regardless of the number in file and there are no duplicates in the
> file.
Any exception will cause the script to die, unless you supply --safe on
the command line. (Exceptions thrown by the SeqIO parser will still let
the script die though.) In your case that wouldn't have helped because
the problem isn't something isolated to a few entries in the file.
Generally speaking though, using --safe is usually preferable once you
know that your file is generally OK.
>
> I tried:
>
> #!/usr/bin/perl -w
> use strict;
> use warnings;
> use Bio::DB::BioDB;
> use Bio::SeqIO;
> my $db = Bio::DB::BioDB->new(
> -database => 'biosql',
> -user => 'root',
> -dbname => 'biosql',
> -host => 'localhost',
> -driver => 'mysql');
> my $in = Bio::SeqIO->new(-format => 'fasta',
> -file => 'just_one_seq.fasta');
> my $seq = $in->next_seq();
> my $pseq = $db->create_persistent($seq);
> $pseq->namespace('bioperl');
> $pseq->create();
>
> No error messages, but nothing goes into the database.
I suspect it does go into the database but since you don't commit it
gets rolled back when the script terminates. Try adding $pseq->commit.
> Basic DBI
> connections outside bioperl work, the OBDA howto example of getting
> sequences from EMBL works, but I can't figure out a way to get the
> single
> sequence put to biosql by load_seqdatabase.pl out of there by
> Bio::DB::Registry.
I believe the latest update on OBDA/Bio::DB::Registry working off of
Biosql was that it doesn't. I haven't had a chance to track the
reported issue down. Bioperl-db etc should definitely work though.
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list