[Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id

Cymon Cox cy at cymon.org
Tue Jun 2 20:29:06 UTC 2009


2009/6/2 Cymon Cox <cy at cymon.org>

> 2009/6/2 <bugzilla-daemon at portal.open-bio.org>
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2833
>>
>>
>>
>>
>>
>> ------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 13:00 EST -------
>> (In reply to comment #18)
>> > (In reply to comment #17)
>> > > How do you feel about this simplistic solution?: if the rules are
>> present,
>> > > before loading a new record, do a query to check to make sure there
>> isn't a
>> > > duplicate already present, and if there is raise an IntegrityError.
>> >
>> > Now thats a much better solution than the way Ive been trying to go...
>> >
>> > This does the trick:
>> > ...
>> > +            if self.postgres_rules_present:
>> > +                self.adaptor.execute("SELECT bioentry_id FROM bioentry
>> "
>> > +                                     "WHERE identifier = '%s'" %
>> > cur_record.id)
>> > +                if self.adaptor.cursor.fetchone():
>> > +                    raise self.adaptor.conn.IntegrityError("Duplicate
>> record "
>> > +                        "detected: record has not been inserted")
>>
>> While the above code looks sensible, I don't think it covers all the cases
>> yet.
>> Essentially the two bioentry rules relate to these two uniqueness rules in
>> the
>> default schema:
>>
>> UNIQUE ( identifier , biodatabase_id )
>> UNIQUE ( accession , biodatabase_id , version )
>>
>> According to rule_bioentry_i1 (or the equivalent rule) we should allow the
>> same
>> bioentry.identifier to appear in different namespaces (i.e. as long as
>> bioentry.biodatabase_id differs). i.e. something like this in your code:
>>
>> "SELECT bioentry_id FROM bioentry WHERE identifier = '%s AND
>> biodatabase_id =
>> %s' % (cur_record.id, self.dbid)
>>
>> Then for rule_bioentry_i2 we also need to check the accession, version and
>> biodatabase_id have not been used before.
>
>
> In principle, we should only have to check for the second case (accession,
> biodatabase_id, version) because the GenBank "gi numbers" (i.e the
> identifier number) parallel the accession.version scheme. When a record
> changes both the gi number changes and the version number is incremented.
> Hence, and unique accession.version implies a unique identifier. In the
> schema, the identifier can be NULL, presumably so that non-GenBank data can
> be stored provided is has a unique accession.version. If we were only to
> check case 2 (accession, biodatabase_id, version) the only way I can see to
> trigger the RULES bug would be to manually assign two different
> accession.version to two records but assign the same (presumably artificial)
> identifier number to both record.annotations["gi"].
>

Whoa, I see now that in Loader._load_bioentry_table that if the
rec.annotations["gi"] is missing, it gets filled with the accession.version:

        if "gi" in record.annotations :
            identifier = record.annotations["gi"]
        else :
            identifier = record.id

So biopythons BioSQL identifiers are not equivalent to GenBank identifiers.
I wonder why this is done and identifier is not just left NULL, and the
unique constraint maintained by accession/version...

Cheers, C.
--



More information about the Biopython-dev mailing list