[BioSQL-l] How to get a Seq object from Bio::DB::Persistent::Seq
Hilmar Lapp
hlapp at gnf.org
Mon Jun 7 19:52:26 EDT 2004
On Jun 3, 2004, at 1:49 AM, jochen wrote:
> Hi,
>
> I have a similar problem, namely I want to modify some sequences and
> store them back in the database, without overwriting any of the
> original
> sequences, basically this:
>
> # retrieve an existing sequence
> my $seq = Bio::Seq::RichSeq->new( -display_id => 'something' );
Note that display_id (bioentry.name) is not constrained by a unique
index and therefore you may easily get duplicate records (which will
cause an exception if searching by unique key).
> $seq = $seqadaptor->find_by_unique_key($seq);
>
> # make sure, $seq isn't persistant anymore
> my $buffer = new IO::String;
> my $out = new Bio::SeqIO(-fh => $buffer, -format => 'embl');
> $out->write_seq($seq);
> $buffer->setpos(0);
> my $in = new Bio::SeqIO(-fh => $buffer, -format => 'embl');
> $seq = $in->next_seq;
>
> # modify it a little
> $seq->primary_id('NEW001');
>
> # create a new copy (fails, just overwrites the old one)
> $seq->create()
With the above code this line needs to throw a perl error for calling a
non-existent function on an object. A sequence stream will never give
you a persistent object.
Should I assume that between the lines you created a persistent object
from the object that the SeqIO stream returned to you?
> A little debugging revealed that there are several unique constraints
> on the bioentry (using postgresql here), which prevent me from
> creating two objects, if they have
>
> o the same primary_id and/or
> o the same (accession_number,version,namespace)
>
> Isn't this an unneccsary restriction? especially, why is primary_id an
> unique constraint, and not (primary_id,namespace)?
>
This was suggested before, and in fact you can change that constraint
to include the identifier. I thought it's in the schema as a commented
out option, but apparently it is not (yet).
Bioperl-db will use, but not mandate, the namespace as additional
constraint when doing a lookup by primary_id.
(accession_number,version,namespace) is a well-established uniqueness
constraint on sequences in order to guarantee a minimal amount of
sanity.
> Even worse, $seq->create in most cases doesn't give an error if there
> is already a similar sequence, but just writes over the existing
> sequence:
It doesn't write over an existing sequence. It will update the
attributes of the object you wanted to create to match those of the
existing object in the database, unless you pass in an object factory
(-obj_factory => $myseqfactory).
>
> In Bio/DB/BioSQL/BasePersistenceAdaptor.pm, line 196-213, you try to
> insert an the new object. If this fails, you conclude this object
> already exists and retrieve it from the DB. Now this behaviour is ok
> for creating the eventually missing foreign key objects. However, if I
> invoke create() on an sequence object, I'd expect this object to be
> newly created or to receive an error.
>
If that's what you expect then run a find_by_unique_key() first to make
sure it's not present already. (Note that this is still no guarantee
because between the time you get the negative result and the time you
commit the create() transaction somebody else may have inserted the
same sequence.)
Note that the method is named create(), not insert_or_fail(). The
purpose is that after the call returns successfully the object on which
you invoked create() has an equivalent entry in the database. It is not
an error if the respective row that you wanted to be present in the
database is already there.
If it were, you'd mandate the user to run in almost all cases the logic
you found at this place if an exception occurs. I.e., you'd require the
user to worry about a lot of absence/presence/concurrency/transactional
possibilities when all that he/she wanted was to make sure the sequence
(as identified by its unique key) is in the database.
Bioperl-db is not a SQL interface. It's an OR mapper. You use it if you
want to live and navigate in object land, not when you want to be close
to the RDBMS vibe. At least that's the goal ...
> What do you think about this? Did I miss something there?
>
> I'd suggest fixing that by introducing two different create functions
> (or a parameter) that controls whether it's ok to retrieve an
> eventually existing object (i.e. when creating the foreign key
> objects) or whether the whole method should fail if there is an
> already existing object.
It's easily achievable on the client end by running the
find_by_unqiue_key() first.
>
>> ...
>> # trigger insert by making the object forget
>> # its primary key
>> $pseq->primary_key(undef);
>> # we need to duplicate dependent objects
>> # (children) too, like features
>> foreach my $pfea ($pseq->get_SeqFeatures) {
>> $pfea->primary_key(undef)
>> if $pfea->isa("Bio::DB::PersistentObjectI");
>> # features have locations
>> $pfea->location->primary_key(undef)
>> if $pfea->location->isa("Bio::DB::PersistentObjectI");
>> }
>> # do the insert
>> $pseq->create();
>
> assuming you just changed the namespace, this code example won't work,
> because you didn't change the primary_id, thus violating the unique
> constraint
Right. It wasn't meant as bullet-proof code. (Note that primary_id is
optional.)
I'm inclined to make the tuple of (identifier,namespace) the default
for the future; there seem to be too many subtle issues otherwise if
you're unsuspecting.
-hilmar
>
> kind regards
> -- jochen
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list