[Bioperl-l] Error in loading into biosql database
Aaron J Mackey
ajm6q at virginia.edu
Tue Mar 25 20:34:58 EST 2003
Does the CRC calculation include the identifier? It seems that would
solve your problem cleanly, but leave you with ugly data (two references
that represent that same literature, with PubMed or Medline IDs).
I was never in on this conversation, but what exactly is being stored by
bioperl-db, and where? If I have both a medline and pubmed ID, which goes
into the dbxref accession? And does the other go into a
dbxref_qualifier_value? Maybe both should go in qualifier_values that you
can later test?
-Aaron
On Tue, 25 Mar 2003, Hilmar Lapp wrote:
> Siddharta,
>
> according to the --debug log you sent and unlike you originally
> reported there is no problem with accession# P42655, nor is there a
> problem with the species insertion or look-up. Apparently one of my
> instructions did fix the original problem.
>
> Instead, the exception that terminates the upload is due to a reference
> entry with medline ID 20489853 in accession# P45954. This reference is
> first encountered in accession# Q9UKU7, but without a medline ID
> (swissprot only has the PubMed ID there). When it is encountered again
> in accession# P45954, there is a Medline ID specified. The look-up for
> the Medline ID 20489853 fails, triggering an insert, which in turn
> fails because the computed CRC must be unique.
>
> This needs to be fixed because the situation is not purely artificial.
> The options are as I see it:
>
> 1) use PubMed instead of Medline if the latter is undef
>
> 2) look-up references by CRC rather than dbxref (medline or pubmed)
>
> 3) look-up, in this or any other order, by CRC, if not found by
> medline, if not found by pubmed, omitting those look-ups where the key
> value is undefined.
>
> Option 1) doesn't really solve the problem, because e.g. even though in
> the concrete case at hand the first occurrence of the reference did
> come with a PubMed ID, you still don't know at the second occurrence
> that you have to look-up by PubMed now instead of Medline (which is
> defined for the second).
>
> Option 2) relies on certain assumptions in order to work, namely a) all
> instances of a reference are fully populated (wrt authors, location,
> title), because otherwise you arrive at different CRC values, and b)
> everyone inserting into the database uses the same CRC calculation
> algorithm (no problem if you only use bioperl).
>
> Option 3) is the most robust (I actually don't quite see when it would
> not work), but potentially costly, and creates a headache for
> implementation because it violates the definition of alternative keys
> (to locate an object it otherwise suffices to locate by any alternative
> key whose value is defined, not by all of them).
>
> Does anyone have opinions, comments, or alternative suggestions?
>
> Siddharta, in the meantime you can bypass failing entries by supplying
> --safe on the command line.
>
> -hilmar
>
> On Tuesday, March 25, 2003, at 02:21 PM, Siddhartha Basu wrote:
>
> > Hi Hilmar,
> >
> >
> > Hilmar Lapp wrote:
> >> If you dropped it and re-created it that should have taken care of
> >> records erroneously without NCBI taxon ID.
> >>
> >> To verify you can query before an upload:
> >>
> >> mysql> SELECT binomial, variant, ncbi_taxon_id FROM taxon WHERE
> >> ncbi_taxon_id IS NULL;
> >>
> >> To confirm for Homo sapiens:
> >>
> >> mysql> SELECT * FROM taxon WHERE binomial = 'Homo sapiens' AND
> >> (ncbi_taxon_id IS NULL OR ncbi_taxon_id != 9606)
> >>
> >> Neither of the 2 queries should return any rows.
> >>
> > Done that, no rows returned.
> >
> >
> >> If they don't and you still get this error then look in your input
> >> file
> >> for the first occurrence of Homo sapiens as species for the sequence.
> >> Does it come with NCBI taxon ID?
> > Yes it does.
> >
> > if yes,
> >> look for the second sequence of Homo sapiens. Does it have accession#
> >> P42655? If yes (*), truncate the taxon table and create a new input
> >> file
> >> with only the first Homo sapiens sequence entry (which supposedly has
> >> a
> >> taxon ID). Try to load the single-entry file.
> > Followed the instruction and it's loaded properly.
> >
> > After that, check your
> >> taxon table. There should be Homo sapiens. If it lacks the NCBI taxon
> >> ID
> >> (*), the problem is with the parser not parsing the taxon ID out of
> >> the
> >> input.
> > It has the NCBI taxon ID.
> >
> >
> >>
> >> (*) if you have to answer 'no' here, there's possibly something weird
> >> going that would need to be fully debugged. You can try to run
> >> load_seqdatabase.pl with --debug and send me the output.
> > Executed load_seqdatabase.pl with --debug and the output is included in
> > the attachment.
> >
> >
> > Siddhartha
> >
> >
> >>
> >> -hilmar
> >>
> > <debuginfo.tar.gz>
>
--
Aaron J Mackey
Pearson Laboratory
University of Virginia
(434) 924-2821
amackey at virginia.edu
More information about the Bioperl-l
mailing list