[Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id

Cymon Cox cy at cymon.org
Tue Jun 2 19:39:18 UTC 2009


2009/6/2 <bugzilla-daemon at portal.open-bio.org>

> http://bugzilla.open-bio.org/show_bug.cgi?id=2833
>
>
>
>
>
> ------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 13:00 EST -------
> (In reply to comment #18)
> > (In reply to comment #17)
> > > How do you feel about this simplistic solution?: if the rules are
> present,
> > > before loading a new record, do a query to check to make sure there
> isn't a
> > > duplicate already present, and if there is raise an IntegrityError.
> >
> > Now thats a much better solution than the way Ive been trying to go...
> >
> > This does the trick:
> > ...
> > +            if self.postgres_rules_present:
> > +                self.adaptor.execute("SELECT bioentry_id FROM bioentry "
> > +                                     "WHERE identifier = '%s'" %
> > cur_record.id)
> > +                if self.adaptor.cursor.fetchone():
> > +                    raise self.adaptor.conn.IntegrityError("Duplicate
> record "
> > +                        "detected: record has not been inserted")
>
> While the above code looks sensible, I don't think it covers all the cases
> yet.
> Essentially the two bioentry rules relate to these two uniqueness rules in
> the
> default schema:
>
> UNIQUE ( identifier , biodatabase_id )
> UNIQUE ( accession , biodatabase_id , version )
>
> According to rule_bioentry_i1 (or the equivalent rule) we should allow the
> same
> bioentry.identifier to appear in different namespaces (i.e. as long as
> bioentry.biodatabase_id differs). i.e. something like this in your code:
>
> "SELECT bioentry_id FROM bioentry WHERE identifier = '%s AND biodatabase_id
> =
> %s' % (cur_record.id, self.dbid)
>
> Then for rule_bioentry_i2 we also need to check the accession, version and
> biodatabase_id have not been used before.


In principle, we should only have to check for the second case (accession,
biodatabase_id, version) because the GenBank "gi numbers" (i.e the
identifier number) parallel the accession.version scheme. When a record
changes both the gi number changes and the version number is incremented.
Hence, and unique accession.version implies a unique identifier. In the
schema, the identifier can be NULL, presumably so that non-GenBank data can
be stored provided is has a unique accession.version. If we were only to
check case 2 (accession, biodatabase_id, version) the only way I can see to
trigger the RULES bug would be to manually assign two different
accession.version to two records but assign the same (presumably artificial)
identifier number to both record.annotations["gi"].

So, how likely is that? Well, not very, but perhaps we need check both ;)

Perhaps we need to first define some unittests of all the permutations,
because the code I submitted doesnt trigger any errors in the current suite.

Cheers, C.



More information about the Biopython-dev mailing list