[Bioperl-l] basic problems with bioperl-db/biosql

Hilmar Lapp hlapp at gnf.org
Mon Oct 18 15:41:16 EDT 2004


On Oct 18, 2004, at 7:11 AM, Mikko Arvas wrote:

>
> However if I have in BioSQL something like this in bioentry and 
> correspondingly
> in biosequence:
> +-------------+----------------+----------+-----------+------------+
> | bioentry_id | biodatabase_id | name | accession | identifier |
> +-------------+----------------+----------+-----------+------------+
> | 9 | 9 | YAL001C | unknown | YAL001C |
> | 12 | 9 | XX0115.2 | ma00001 | XX0115.2 |
> +-------------+----------------+----------+-----------+------------+
>
> and I do this:
>
> #!/usr/bin/perl -w
> use strict;
> use warnings;
> use Bio::Seq;
> use Bio::Seq::SeqFactory;
> use Bio::DB::BioDB;
> my $db = Bio::DB::BioDB->new( -database => 'biosql', -user => 'root',
> -dbname => 'biosql', -host => 'localhost', -driver => 'mysql');
> my $seq =Bio::Seq->new(-primary_id => "XX0115.2",
> -namespace => "bioperl");
> my $seqfact = Bio::Seq::SeqFactory->new(-type=>"Bio::Seq");
> my $adp = $db->get_object_adaptor($seq);
> my $dbseq=$adp->find_by_unique_key($seq, -obj_factory => $seqfact);
>
> I get the XX0115.2, but if that doesn't exist in the database I get 
> YAL001C instead,
> which is a little bit funny.

It is due to the (stupid IMO) rule in bioperl that the default value 
for the accession number is 'unknown'. Also, there are multiple unique 
keys on bioentry, and the adaptor will search all of them until it 
finds a match. So, if using the identifier (primary_id) you set fails 
it will try the accession number ('unknown') and version (default 0) - 
which will match the YAL001C entry.

That's why it is almost never a good idea to let sequences with 
accession number 'unknown' into your database ... (apart from the fact 
that you can have only one per namespace anyway unless you increment 
versions ...).

>  If I use instead:
> my $seq =Bio::Seq->new(-accession => "XX0115.2", -namespace => 
> "bioperl");
> it works as it should.
>
> Is this do to the fact that YAL001C doesn't have an accession, so I 
> should make sure that there is always an accession

Right. Accession is required (NOT NULL) in biosql, and it's not a good 
idea to leave it at a non-meaningful default.

Primary_id (or Identifier in biosql) is rather meant for 'internal' 
identifiers. E.g., NCBI's GI number, or the source primary key if you 
imported seqs from somewhere. Almost all 'identifiers' you encounter 
will be accessions, not primary_id's in the sense of bioperl.

>  or is the -primery_id search somehow problematic and should be
> avoided?

No, not at all, it's perfectly fine ...

Great if you find the software is useful. That's the goal :-)

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the Bioperl-l mailing list