[Bioperl-l] Error in loading into biosql database
Hilmar Lapp
hlapp at gnf.org
Tue Mar 25 18:00:51 EST 2003
On Tuesday, March 25, 2003, at 05:34 PM, Aaron J Mackey wrote:
>
> Does the CRC calculation include the identifier?
No. Only authors, location, and title.
> It seems that would
> solve your problem cleanly, but leave you with ugly data (two
> references
> that represent that same literature, with PubMed or Medline IDs).
Well, ugly is not really 'clean' is it. I.e., it would solve the UK
throw-up, but rather hacky, not cleanly IMHO. Also, it doesn't solve
the partially populated object problem. I.e., doing something like
$ref = Bio::Annotation::Reference->new(-medline => 2652651);
$dbref = $refadp->find_by_unique_key($ref);
will not work because if it goes by CRC that will not be the same as
the one in the database computed from a fully populated object. You
want it to search by medline ID in this case (or, more generally, try
CRC, then try medline, etc). You could still force it to by
constructing an explicit BioQuery object, but that's not being very
nice to the user, and also requires special case code in the adaptor.
>
> I was never in on this conversation, but what exactly is being stored
> by
> bioperl-db, and where? If I have both a medline and pubmed ID, which
> goes
> into the dbxref accession?
Right now the medline ID (medline()). pubmed() is ignored. (can be
changed, but won't solve the problem at hand)
> And does the other go into a dbxref_qualifier_value?
No. Could be done, but wouldn't help, unless you join in
dbxref_qualifier_value and term for the dbxref look-up query. Lots of
special case SQL construction code, doesn't apply in most cases (i.e.,
'regular' dbxref lookups), and is not going to be a fast lookup. As an
aside, the database cannot guarantee a unique return for this lookup
anymore (as there is no constraint enforcing it).
> Maybe both should go in qualifier_values that you
> can later test?
And what do you put into dbxref.accession then? Apart from that
question, the above still applies. I'm not sure why this would help.
-hilmar
>
> -Aaron
>
> On Tue, 25 Mar 2003, Hilmar Lapp wrote:
>
>> Siddharta,
>>
>> according to the --debug log you sent and unlike you originally
>> reported there is no problem with accession# P42655, nor is there a
>> problem with the species insertion or look-up. Apparently one of my
>> instructions did fix the original problem.
>>
>> Instead, the exception that terminates the upload is due to a
>> reference
>> entry with medline ID 20489853 in accession# P45954. This reference is
>> first encountered in accession# Q9UKU7, but without a medline ID
>> (swissprot only has the PubMed ID there). When it is encountered again
>> in accession# P45954, there is a Medline ID specified. The look-up for
>> the Medline ID 20489853 fails, triggering an insert, which in turn
>> fails because the computed CRC must be unique.
>>
>> This needs to be fixed because the situation is not purely artificial.
>> The options are as I see it:
>>
>> 1) use PubMed instead of Medline if the latter is undef
>>
>> 2) look-up references by CRC rather than dbxref (medline or pubmed)
>>
>> 3) look-up, in this or any other order, by CRC, if not found by
>> medline, if not found by pubmed, omitting those look-ups where the key
>> value is undefined.
>>
>> Option 1) doesn't really solve the problem, because e.g. even though
>> in
>> the concrete case at hand the first occurrence of the reference did
>> come with a PubMed ID, you still don't know at the second occurrence
>> that you have to look-up by PubMed now instead of Medline (which is
>> defined for the second).
>>
>> Option 2) relies on certain assumptions in order to work, namely a)
>> all
>> instances of a reference are fully populated (wrt authors, location,
>> title), because otherwise you arrive at different CRC values, and b)
>> everyone inserting into the database uses the same CRC calculation
>> algorithm (no problem if you only use bioperl).
>>
>> Option 3) is the most robust (I actually don't quite see when it would
>> not work), but potentially costly, and creates a headache for
>> implementation because it violates the definition of alternative keys
>> (to locate an object it otherwise suffices to locate by any
>> alternative
>> key whose value is defined, not by all of them).
>>
>> Does anyone have opinions, comments, or alternative suggestions?
>>
>> Siddharta, in the meantime you can bypass failing entries by supplying
>> --safe on the command line.
>>
>> -hilmar
>>
>> On Tuesday, March 25, 2003, at 02:21 PM, Siddhartha Basu wrote:
>>
>>> Hi Hilmar,
>>>
>>>
>>> Hilmar Lapp wrote:
>>>> If you dropped it and re-created it that should have taken care of
>>>> records erroneously without NCBI taxon ID.
>>>>
>>>> To verify you can query before an upload:
>>>>
>>>> mysql> SELECT binomial, variant, ncbi_taxon_id FROM taxon WHERE
>>>> ncbi_taxon_id IS NULL;
>>>>
>>>> To confirm for Homo sapiens:
>>>>
>>>> mysql> SELECT * FROM taxon WHERE binomial = 'Homo sapiens' AND
>>>> (ncbi_taxon_id IS NULL OR ncbi_taxon_id != 9606)
>>>>
>>>> Neither of the 2 queries should return any rows.
>>>>
>>> Done that, no rows returned.
>>>
>>>
>>>> If they don't and you still get this error then look in your input
>>>> file
>>>> for the first occurrence of Homo sapiens as species for the
>>>> sequence.
>>>> Does it come with NCBI taxon ID?
>>> Yes it does.
>>>
>>> if yes,
>>>> look for the second sequence of Homo sapiens. Does it have
>>>> accession#
>>>> P42655? If yes (*), truncate the taxon table and create a new input
>>>> file
>>>> with only the first Homo sapiens sequence entry (which supposedly
>>>> has
>>>> a
>>>> taxon ID). Try to load the single-entry file.
>>> Followed the instruction and it's loaded properly.
>>>
>>> After that, check your
>>>> taxon table. There should be Homo sapiens. If it lacks the NCBI
>>>> taxon
>>>> ID
>>>> (*), the problem is with the parser not parsing the taxon ID out of
>>>> the
>>>> input.
>>> It has the NCBI taxon ID.
>>>
>>>
>>>>
>>>> (*) if you have to answer 'no' here, there's possibly something
>>>> weird
>>>> going that would need to be fully debugged. You can try to run
>>>> load_seqdatabase.pl with --debug and send me the output.
>>> Executed load_seqdatabase.pl with --debug and the output is included
>>> in
>>> the attachment.
>>>
>>>
>>> Siddhartha
>>>
>>>
>>>>
>>>> -hilmar
>>>>
>>> <debuginfo.tar.gz>
>>
>
> --
> Aaron J Mackey
> Pearson Laboratory
> University of Virginia
> (434) 924-2821
> amackey at virginia.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list