[Bioperl-l] basic problems with bioperl-db/biosql

Sun Oct 17 00:19:43 EDT 2004

On Thursday, October 14, 2004, at 08:08  AM, Mikko Arvas wrote:

> Hi,
>
> I am trying to get started using bioperl-db, but I am failing 
> miserably.
> I got bioperl 1.4 and the latest bioperl-db and biosql tarballs from 
> CVS on
> SuSe 8.1.

There is a fix to the GO-format ontology parser that is on the 1.4 
branch but not in the 1.4.0 release. If you intend to load GO or 
SO/SOFA, you will want that fix therefore upgrade from the 1.4 CVS 
branch.

>
> Installation gave one error:
> t/simpleseq.....ok 6/59gzip: t/data/Titin.fasta.gz: No such file or
> directory
> Can't call method "namespace" on an undefined value at t/simpleseq.t 
> line
> 48.

Sorry about that, I forgot to add the file. It's in CVS now.

>
> I tried:
>> perl load_seqdatabase.pl  --dbname biosql --format fasta  test.fasta
> Loading test.fasta ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
> ("NCRA-XX3-01-000002","NCRA-XX3-01-000002","unknown","NCU10033.1
> predicted protein (1437 - 1101)","0","") FKs (9,<NULL>)
> Duplicate entry 'unknown-9-0' for key 2
> ---------------------------------------------------

Technically, this is because a previous entry also had 'unknown' as 
accession number and version 0.

Practically, this is not your fault. Basically, this will happen 
inevitably if you load fasta format straight away. The reason is that 
$seq->accession_number is not set by the bioperl fasta parser (and, 
worse yet, defaults to 'unknown'), and $seq->primary_id as well as 
$seq->display_id are set to the ID part from the fasta description 
line. Biosql mandates through a unique key constraint that the 
combination of accession number, version, and namespace be unique, 
which it can't be if all accession numbers have the same value.

If your sequences come in fasta format, you almost certainly want to 
write your own SequenceProcessor (see POD for 
Bio::Factory::SequenceProcessorI) to set the IDs straight. Using 
Bio::Seq::BaseSeqProcessor as a base class this is a relatively simple 
task. As an example:

package My::FastaSeqProcessor;
use vars qw(@ISA);
use strict;
use Bio::Seq::BaseSeqProcessor;
@ISA = qw(Bio::Seq::BaseSeqProcessor);

sub process_seq{
     my ($self,$seq) = @_;
     # I don't think the fasta ID qualifies as primary_id
     $seq->primary_id(undef);
     # we do want to have an accession number
     $seq->accession_number($seq->display_id);
     # there's many more things you could do here ...
     return ($seq); # make sure you return an array!
}
1;

You then pipeline your sequence objects through your module using 
--pipeline "My::FastaSeqProcessor".

> Could not store unknown:
> ------------- EXCEPTION  -------------
> MSG: You're trying to lie about the length: is 56 but you say 1161

This is caused by the previous problem. The reason is that when the 
INSERT fails it tries to find the entry in the database whose existence 
caused the unique key violation. In your case, it will be totally 
unrelated, and set the sequence length to something that's got nothing 
to do with the sequence.

> etc....
>
> It reads always just one sequence to bioentry/biosequence tables in
> regardless of the number in file and there are no duplicates in the 
> file.

Any exception will cause the script to die, unless you supply --safe on 
the command line. (Exceptions thrown by the SeqIO parser will still let 
the script die though.) In your case that wouldn't have helped because 
the problem isn't something isolated to a few entries in the file. 
Generally speaking though, using --safe is usually preferable once you 
know that your file is generally OK.

>
> I tried:
>
> #!/usr/bin/perl -w
> use strict;
> use warnings;
> use Bio::DB::BioDB;
> use Bio::SeqIO;
> my  $db = Bio::DB::BioDB->new(
>         -database => 'biosql',
>         -user   => 'root',
>         -dbname => 'biosql',
>         -host   => 'localhost',
>         -driver => 'mysql');
> my $in = Bio::SeqIO->new(-format => 'fasta',
> 			-file => 'just_one_seq.fasta');
> my $seq = $in->next_seq();
> my $pseq = $db->create_persistent($seq);
> $pseq->namespace('bioperl');
> $pseq->create();
>
> No error messages, but nothing goes into the database.

I suspect it does go into the database but since you don't commit it 
gets rolled back when the script terminates. Try adding $pseq->commit.

>  Basic DBI
> connections outside bioperl work, the OBDA howto example of getting
> sequences from EMBL works, but I can't figure out a way to get the 
> single
> sequence put to biosql by load_seqdatabase.pl out of there by
> Bio::DB::Registry.

I believe the latest update on OBDA/Bio::DB::Registry working off of 
Biosql was that it doesn't. I haven't had a chance to track the 
reported issue down. Bioperl-db etc should definitely work though.

	-hilmar

-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------